resources: Add resources to build and execute NPB

Change-Id: If0939e0cc3f7b94a6a8290c483b62ec786e5e04a
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5-resources/+/29732
Reviewed-by: Bobby R. Bruce <bbruce@ucdavis.edu>
Maintainer: Bobby R. Bruce <bbruce@ucdavis.edu>
Tested-by: Bobby R. Bruce <bbruce@ucdavis.edu>
diff --git a/README.md b/README.md
old mode 100644
new mode 100755
index d1f5e58..716cc1b
--- a/README.md
+++ b/README.md
@@ -217,6 +217,21 @@
 A pre-build parsec benchmark image, for X86, can be found here:
 <http://dist.gem5.org/images/x86/ubuntu-18-04/parsec>.
 
+# Resource: NAS Parallel Benchmarks (NPB) Tests
+
+The [NAS Parallel Benchmarks] (NPB) are a small set of programs designed to
+help evaluate the performance of parallel supercomputers. The set consists of
+five kenels and three pseudo-applications. gem5 resources provides a disk
+image, and scripts allowing for the NPB image to be run within gem5 X86
+simulations. A pre-build npb disk image can be downloaded here:
+<http://dist.gem5.org/images/x86/ubuntu-18-04/npb>.
+
+The npb resources can be found in `src/npb`. It consists of:
+- npb disk image resources
+- gem5 run scripts to execute these tests
+
+The instructions to build the npb disk image, a Linux kernel binary, and how to use gem5 run scripts to run npb are available in the [README](src/npb-tests/README.md) file.
+
 # Licensing
 
 Each project under the `src` is under a different license. Before using
@@ -239,3 +254,5 @@
 `src/parsec/disk-image/parsec/parsec-benchmark/LICENSE`. This is a 3-Clause
 BSD License (A Princeton University copyright). For the remaining files, please
 consult copyright notices in the source files.
+* **npb-tests**: Consult individual copyright notices of source files in
+`src/npb`. The NAS Parallel Benchmarks utilize a permissive BSD-style license.
diff --git a/src/npb/README.md b/src/npb/README.md
new file mode 100755
index 0000000..441e1e6
--- /dev/null
+++ b/src/npb/README.md
@@ -0,0 +1,128 @@
+# NAS Parallel Benchmarks (NPB) Tests
+
+This document provides instructions to create a disk image and a Linux binary to run the NPB tests with gem5 and points to the gem5 configuration files needed to run these tests.
+NAS parallel benchmarks ([NPB](https://www.nas.nasa.gov/)) belongs to the category of high performance computing (HPC) workloads and consists of different kernel and pseudo applications:
+
+Kernels:
+- **IS:** Integer Sort, random memory access
+- **EP:** Embarrassingly Parallel
+- **CG:** Conjugate Gradient, irregular memory access and communication
+- **MG:** Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive
+- **FT:** discrete 3D fast Fourier Transform, all-to-all communication
+
+Pseudo Applications:
+- **BT:** Block Tri-diagonal solver
+- **SP:** Scalar Penta-diagonal solver
+- **LU:** Lower-Upper Gauss-Seidel solver
+
+There are different classes (A,B,C,D,E and F) of the workloads based on the data size that is used with the benchmarks. Detailed discussion of the data sizes is available [here](https://www.nas.nasa.gov/publications/npb_problem_sizes.html).
+
+We make use of a modified source of NPB for these tests, which can be found in `disk-images/npb/npb-hooks`.
+This source of NPB has ROI (region of interest) annotations for each benchmark which will be used by gem5 to separate out simulation statistics of the important parts of a program from the rest of the program. Basically, gem5 magic instructions are used before and after the ROI which exit the guest and transfer control to gem5 run script which can then do things like dumping or resetting stats or switching to cpu of interest.
+
+**Note:** The instructions in this README are based on experiments with gem5-20.
+
+We assume the following directory structure while following the instructions in this README file:
+
+```
+npb/
+  |___ gem5/                               # gem5 source code
+  |
+  |___ disk-image/
+  |      |___ shared/                      # Auxiliary files needed for disk creation
+  |      |___ npb/
+  |      |     |___ npb-image/             # Will be created once the disk is generated
+  |            |      |___ npb             # The generated disk image
+  |            |___ npb.json               # The Packer script to build the disk image
+  |            |___ runscript.sh           # Executes a user provided script in simulated guest
+  |            |___ post-installation.sh   # Moves runscript.sh to guest's .bashrc
+  |            |___ npb-install.sh         # Compiles NPB inside the generated disk image
+  |            |___ npb-hooks              # The NPB source (modified to function better with gem5).
+  |
+  |___ configs
+  |      |___ system                       # gem5 system config files
+  |      |___ run_npb.py                   # gem5 run script to run NPB tests
+  |
+  |___ linux                               # Linux source and binary will live here
+  |
+  |___ README.md                           # This README file
+```
+
+## Disk Image
+
+Assuming that you are in the `src/npb/` directory (the directory containing this README), first build `m5` (which is needed to create the disk image):
+
+```sh
+git clone https://gem5.googlesource.com/public/gem5
+cd gem5/util/m5
+scons build/x86/out/m5
+```
+
+Next,
+
+```sh
+cd disk-image
+# if packer is not already installed
+wget https://releases.hashicorp.com/packer/1.4.3/packer_1.4.3_linux_amd64.zip
+unzip packer_1.4.3_linux_amd64.zip
+
+# validate the packer script
+./packer validate npb/npb.json
+# build the disk image
+./packer build npb/npb.json
+```
+
+Once this process succeeds, the created disk image can be found on `npb/npb-image/npb`.
+A disk image already created following the above instructions can be found [here](http://dist.gem5.org/images/x86/ubuntu-18-04/npb) (**warning:** file size is 2.3 GB).
+
+For more information on the npb disk creation process using packer refer [here](https://gem5art.readthedocs.io/en/latest/main-doc/disks.html#) and [here](https://gem5art.readthedocs.io/en/latest/tutorials/npb-tutorial.html).
+
+## gem5 Run Scripts
+
+The gem5 scripts which configure the system and run simulation are available in configs-npb-tests/.
+The main script `run_npb.py` expects following arguments:
+
+**kernel:** path to the Linux kernel.
+
+**disk:** path to the npb disk image.
+
+**cpu:** CPU model (`kvm`, `atomic`, `timing`).
+
+**mem_sys:** memory system (`classic`, `MI_example`, `MESI_Two_Level`, `MOESI_CMP_directory`).
+
+**benchmark:** NPB benchmark to execute (`bt.A.x`, `cg.A.x`, `ep.A.x`, `ft.A.x`, `is.A.x`, `lu.A.x`, `mg.A.x`,  `sp.A.x`).
+
+**Note:**
+By default, the previously written instructions to build npb disk image will build class `A`,`B`,`C` and `D` of NPB in the disk image.
+We have only tested class `A` of the NPB.
+Replace `A` with any other class in the above listed benchmark names to test with other classes.
+
+**num_cpus:** number of CPU cores.
+
+An example of how to use these scripts:
+
+```sh
+build/X86/gem5.opt configs-npb-tests/run_npb.py [path to the Linux kernel] [path to the npb disk image] kvm classic bt.A.x 4
+```
+
+## Linux Kernel
+
+These tests use Linux kernel version 4.19.83, which can be compiled using following instructions (assuming that you are in `src/npb/` directory):
+
+```sh
+git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+cd linux
+git checkout v4.19.83
+# copy the Linux kernel configuration file for v4.19.83 from boot-tests/linux-configs/
+cp ../../boot-tests/linux-configs/config.4.19.83 .config
+make -j8
+```
+The compiled Linux binary will be named as `vmlinux`.
+
+**Note:** The above instructions are tested with `gcc 7.5.0` and an already compiled Linux binary can be downloaded from the following link:
+
+- [vmlinux-4.19.83](http://dist.gem5.org/kernels/x86/static/vmlinux-4.19.83)
+
+## Working Status
+
+The working status of these tests for gem5-20 can be found [here](https://www.gem5.org/documentation/benchmark_status/#npb-tests).
diff --git a/src/npb/configs/run_npb.py b/src/npb/configs/run_npb.py
new file mode 100755
index 0000000..2bbe838
--- /dev/null
+++ b/src/npb/configs/run_npb.py
@@ -0,0 +1,172 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 The Regents of the University of California.
+# All rights reserved.
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power, Ayaz Akram
+
+""" Script to run NAS parallel benchmarks with gem5.
+    The script expects kernel, diskimage, mem_sys,
+    cpu (kvm, atomic, or timing), benchmark to run
+    and number of cpus as arguments.
+
+    If your application has ROI annotations, this script will count the total
+    number of instructions executed in the ROI. It also tracks how much
+    wallclock and simulated time.
+"""
+import errno
+import os
+import sys
+import time
+import m5
+import m5.ticks
+from m5.objects import *
+
+sys.path.append('gem5/configs/common/') # For the next line...
+import SimpleOpts
+
+from system import *
+
+def writeBenchScript(dir, bench):
+    """
+    This method creates a script in dir which will be eventually
+    passed to the simulated system (to run a specific benchmark
+    at bootup).
+    """
+    file_name = '{}/run_{}'.format(dir, bench)
+    bench_file = open(file_name,"w+")
+    bench_file.write('/home/gem5/NPB3.3-OMP/bin/{} \n'.format(bench))
+
+    # sleeping for sometime (5 seconds here) makes sure
+    # that the benchmark's output has been
+    # printed to the console
+    bench_file.write('sleep 5 \n')
+    bench_file.write('m5 exit \n')
+    bench_file.close()
+    return file_name
+
+if __name__ == "__m5_main__":
+    (opts, args) = SimpleOpts.parse_args()
+    kernel, disk, cpu, mem_sys, benchmark, num_cpus = args
+
+    if not cpu in ['atomic', 'kvm', 'timing']:
+        m5.fatal("cpu not supported")
+
+    # create the system we are going to simulate
+    system = MySystem(kernel, disk, int(num_cpus), opts, no_kvm=False)
+
+
+    ruby_protocols = [ "MI_example", "MESI_Two_Level", "MOESI_CMP_directory"]
+
+    if mem_sys == "classic":
+        system = MySystem(kernel, disk, int(num_cpus), opts, no_kvm=False)
+    elif mem_sys in ruby_protocols:
+        system = MyRubySystem(kernel, disk, mem_sys, int(num_cpus), opts)
+    else:
+        m5.fatal("Bad option for mem_sys")
+
+    # Exit from guest on workbegin/workend
+    system.exit_on_work_items = True
+
+    # Create and pass a script to the simulated system to run the reuired
+    # benchmark
+    system.readfile = writeBenchScript(m5.options.outdir, benchmark)
+
+    # set up the root SimObject and start the simulation
+    root = Root(full_system = True, system = system)
+
+    if system.getHostParallel():
+        # Required for running kvm on multiple host cores.
+        # Uses gem5's parallel event queue feature
+        # Note: The simulator is quite picky about this number!
+        root.sim_quantum = int(1e9) # 1 ms
+
+    #needed for long running jobs
+    m5.disableAllListeners()
+
+    # instantiate all of the objects we've created above
+    m5.instantiate()
+
+    globalStart = time.time()
+
+    print("Running the simulation")
+    print("Using cpu: {}".format(cpu))
+    exit_event = m5.simulate()
+
+    if exit_event.getCause() == "workbegin":
+        # Reached the start of ROI
+        # start of ROI is marked by an
+        # m5_work_begin() call
+        print("Resetting stats at the start of ROI!")
+        m5.stats.reset()
+        start_tick = m5.curTick()
+        start_insts = system.totalInsts()
+        # switching cpu if argument cpu == atomic or timing
+        if cpu == 'atomic':
+            system.switchCpus(system.cpu, system.atomicCpu)
+        if cpu == 'timing':
+            system.switchCpus(system.cpu, system.timingCpu)
+    else:
+        print("Unexpected termination of simulation !")
+        exit()
+
+    # Simulate the ROI
+    exit_event = m5.simulate()
+
+    # Reached the end of ROI
+    # Finish executing the benchmark
+
+    print("Dump stats at the end of the ROI!")
+    m5.stats.dump()
+    end_tick = m5.curTick()
+    end_insts = system.totalInsts()
+    m5.stats.reset()
+
+    # Switching back to KVM does not work
+    # with Ruby mem protocols, so not
+    # switching back to simulate the remaining
+    # part
+
+    if mem_sys in ruby_protocols:
+        print("Ruby Mem: Not Switching back to KVM!")
+
+    if mem_sys == 'classic':
+        # switch cpu back to kvm if atomic/timing was used for ROI
+        if cpu == 'atomic':
+            system.switchCpus(system.atomicCpu, system.cpu)
+        if cpu == 'timing':
+            system.switchCpus(system.timingCpu, system.cpu)
+
+        # Simulate the remaning part of the benchmark
+        exit_event = m5.simulate()
+
+    print("Done with the simulation")
+    print()
+    print("Performance statistics:")
+
+    print("Simulated time in ROI: %.2fs" % ((end_tick-start_tick)/1e12))
+    print("Instructions executed in ROI: %d" % ((end_insts-start_insts)))
+    print("Ran a total of", m5.curTick()/1e12, "simulated seconds")
+    print("Total wallclock time: %.2fs, %.2f min" % \
+                (time.time()-globalStart, (time.time()-globalStart)/60))
diff --git a/src/npb/configs/system/MESI_Two_Level.py b/src/npb/configs/system/MESI_Two_Level.py
new file mode 100755
index 0000000..39af672
--- /dev/null
+++ b/src/npb/configs/system/MESI_Two_Level.py
@@ -0,0 +1,342 @@
+#Copyright (c) 2020 The Regents of the University of California.
+#All Rights Reserved
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+
+""" This file creates a set of Ruby caches for the MESI TWO Level protocol
+This protocol models two level cache hierarchy. The L1 cache is split into
+instruction and data cache.
+
+This system support the memory size of up to 3GB.
+
+"""
+
+from __future__ import print_function
+from __future__ import absolute_import
+
+import math
+
+from m5.defines import buildEnv
+from m5.util import fatal, panic
+
+from m5.objects import *
+
+class MESITwoLevelCache(RubySystem):
+
+    def __init__(self):
+        if buildEnv['PROTOCOL'] != 'MESI_Two_Level':
+            fatal("This system assumes MESI_Two_Level!")
+
+        super(MESITwoLevelCache, self).__init__()
+
+        self._numL2Caches = 8
+
+    def setup(self, system, cpus, mem_ctrls, dma_ports, iobus):
+        """Set up the Ruby cache subsystem. Note: This can't be done in the
+           constructor because many of these items require a pointer to the
+           ruby system (self). This causes infinite recursion in initialize()
+           if we do this in the __init__.
+        """
+        # Ruby's global network.
+        self.network = MyNetwork(self)
+
+        # MESI_Two_Level example uses 5 virtual networks
+        self.number_of_virtual_networks = 5
+        self.network.number_of_virtual_networks = 5
+
+        # There is a single global list of all of the controllers to make it
+        # easier to connect everything to the global network. This can be
+        # customized depending on the topology/network requirements.
+        # L1 caches are private to a core, hence there are one L1 cache per CPU
+        # core. The number of L2 caches are dependent to the architecture.
+        self.controllers = \
+            [L1Cache(system, self, cpu, self._numL2Caches) for cpu in cpus] + \
+            [L2Cache(system, self, self._numL2Caches) for num in \
+            range(self._numL2Caches)] + \
+            [DirController(self, system.mem_ranges, mem_ctrls)] + \
+            [DMAController(self) for i in range(len(dma_ports))]
+
+        # Create one sequencer per CPU and dma controller.
+        # Sequencers for other controllers can be here here.
+        self.sequencers = [RubySequencer(version = i,
+                                # I/D cache is combined and grab from ctrl
+                                icache = self.controllers[i].L1Icache,
+                                dcache = self.controllers[i].L1Dcache,
+                                clk_domain = self.controllers[i].clk_domain,
+                                pio_master_port = iobus.slave,
+                                mem_master_port = iobus.slave,
+                                pio_slave_port = iobus.master
+                                ) for i in range(len(cpus))] + \
+                          [DMASequencer(version = i,
+                                        slave = port)
+                            for i,port in enumerate(dma_ports)
+                          ]
+
+        for i,c in enumerate(self.controllers[:len(cpus)]):
+            c.sequencer = self.sequencers[i]
+
+        #Connecting the DMA sequencer to DMA controller
+        for i,d in enumerate(self.controllers[-len(dma_ports):]):
+            i += len(cpus)
+            d.dma_sequencer = self.sequencers[i]
+
+        self.num_of_sequencers = len(self.sequencers)
+
+        # Create the network and connect the controllers.
+        # NOTE: This is quite different if using Garnet!
+        self.network.connectControllers(self.controllers)
+        self.network.setup_buffers()
+
+        # Set up a proxy port for the system_port. Used for load binaries and
+        # other functional-only things.
+        self.sys_port_proxy = RubyPortProxy()
+        system.system_port = self.sys_port_proxy.slave
+        self.sys_port_proxy.pio_master_port = iobus.slave
+
+        # Connect the cpu's cache, interrupt, and TLB ports to Ruby
+        for i,cpu in enumerate(cpus):
+            cpu.icache_port = self.sequencers[i].slave
+            cpu.dcache_port = self.sequencers[i].slave
+            cpu.createInterruptController()
+            isa = buildEnv['TARGET_ISA']
+            if isa == 'x86':
+                cpu.interrupts[0].pio = self.sequencers[i].master
+                cpu.interrupts[0].int_master = self.sequencers[i].slave
+                cpu.interrupts[0].int_slave = self.sequencers[i].master
+            if isa == 'x86' or isa == 'arm':
+                cpu.itb.walker.port = self.sequencers[i].slave
+                cpu.dtb.walker.port = self.sequencers[i].slave
+
+
+class L1Cache(L1Cache_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, system, ruby_system, cpu, num_l2Caches):
+        """Creating L1 cache controller. Consist of both instruction
+           and data cache. The size of data cache is 512KB and
+           8-way set associative. The instruction cache is 32KB,
+           2-way set associative.
+        """
+        super(L1Cache, self).__init__()
+
+        self.version = self.versionCount()
+        block_size_bits = int(math.log(system.cache_line_size, 2))
+        l1i_size = '32kB'
+        l1i_assoc = '2'
+        l1d_size = '512kB'
+        l1d_assoc = '8'
+        # This is the cache memory object that stores the cache data and tags
+        self.L1Icache = RubyCache(size = l1i_size,
+                                assoc = l1i_assoc,
+                                start_index_bit = block_size_bits ,
+                                is_icache = True)
+        self.L1Dcache = RubyCache(size = l1d_size,
+                            assoc = l1d_assoc,
+                            start_index_bit = block_size_bits,
+                            is_icache = False)
+        self.l2_select_num_bits = int(math.log(num_l2Caches , 2))
+        self.clk_domain = cpu.clk_domain
+        self.prefetcher = RubyPrefetcher()
+        self.send_evictions = self.sendEvicts(cpu)
+        self.transitions_per_cycle = 4
+        self.enable_prefetch = False
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def getBlockSizeBits(self, system):
+        bits = int(math.log(system.cache_line_size, 2))
+        if 2**bits != system.cache_line_size.value:
+            panic("Cache line size not a power of 2!")
+        return bits
+
+    def sendEvicts(self, cpu):
+        """True if the CPU model or ISA requires sending evictions from caches
+           to the CPU. Two scenarios warrant forwarding evictions to the CPU:
+           1. The O3 model must keep the LSQ coherent with the caches
+           2. The x86 mwait instruction is built on top of coherence
+           3. The local exclusive monitor in ARM systems
+        """
+        if type(cpu) is DerivO3CPU or \
+           buildEnv['TARGET_ISA'] in ('x86', 'arm'):
+            return True
+        return False
+
+    def connectQueues(self, ruby_system):
+        """Connect all of the queues for this controller.
+        """
+        self.mandatoryQueue = MessageBuffer()
+        self.requestFromL1Cache = MessageBuffer()
+        self.requestFromL1Cache.master = ruby_system.network.slave
+        self.responseFromL1Cache = MessageBuffer()
+        self.responseFromL1Cache.master = ruby_system.network.slave
+        self.unblockFromL1Cache = MessageBuffer()
+        self.unblockFromL1Cache.master = ruby_system.network.slave
+
+        self.optionalQueue = MessageBuffer()
+
+        self.requestToL1Cache = MessageBuffer()
+        self.requestToL1Cache.slave = ruby_system.network.master
+        self.responseToL1Cache = MessageBuffer()
+        self.responseToL1Cache.slave = ruby_system.network.master
+
+class L2Cache(L2Cache_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, system, ruby_system, num_l2Caches):
+
+        super(L2Cache, self).__init__()
+
+        self.version = self.versionCount()
+        # This is the cache memory object that stores the cache data and tags
+        self.L2cache = RubyCache(size = '1 MB',
+                                assoc = 16,
+                                start_index_bit = self.getBlockSizeBits(system,
+                                num_l2Caches))
+
+        self.transitions_per_cycle = '4'
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def getBlockSizeBits(self, system, num_l2caches):
+        l2_bits = int(math.log(num_l2caches, 2))
+        bits = int(math.log(system.cache_line_size, 2)) + l2_bits
+        return bits
+
+
+    def connectQueues(self, ruby_system):
+        """Connect all of the queues for this controller.
+        """
+        self.DirRequestFromL2Cache = MessageBuffer()
+        self.DirRequestFromL2Cache.master = ruby_system.network.slave
+        self.L1RequestFromL2Cache = MessageBuffer()
+        self.L1RequestFromL2Cache.master = ruby_system.network.slave
+        self.responseFromL2Cache = MessageBuffer()
+        self.responseFromL2Cache.master = ruby_system.network.slave
+        self.unblockToL2Cache = MessageBuffer()
+        self.unblockToL2Cache.slave = ruby_system.network.master
+        self.L1RequestToL2Cache = MessageBuffer()
+        self.L1RequestToL2Cache.slave = ruby_system.network.master
+        self.responseToL2Cache = MessageBuffer()
+        self.responseToL2Cache.slave = ruby_system.network.master
+
+
+class DirController(Directory_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system, ranges, mem_ctrls):
+        """ranges are the memory ranges assigned to this controller.
+        """
+        if len(mem_ctrls) > 1:
+            panic("This cache system can only be connected to one mem ctrl")
+        super(DirController, self).__init__()
+        self.version = self.versionCount()
+        self.addr_ranges = ranges
+        self.ruby_system = ruby_system
+        self.directory = RubyDirectoryMemory()
+        # Connect this directory to the memory side.
+        self.memory = mem_ctrls[0].port
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.requestToDir = MessageBuffer()
+        self.requestToDir.slave = ruby_system.network.master
+        self.responseToDir = MessageBuffer()
+        self.responseToDir.slave = ruby_system.network.master
+        self.responseFromDir = MessageBuffer()
+        self.responseFromDir.master = ruby_system.network.slave
+        self.requestToMemory = MessageBuffer()
+        self.responseFromMemory = MessageBuffer()
+
+class DMAController(DMA_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system):
+        super(DMAController, self).__init__()
+        self.version = self.versionCount()
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.mandatoryQueue = MessageBuffer()
+        self.responseFromDir = MessageBuffer(ordered = True)
+        self.responseFromDir.slave = ruby_system.network.master
+        self.requestToDir = MessageBuffer()
+        self.requestToDir.master = ruby_system.network.slave
+
+
+class MyNetwork(SimpleNetwork):
+    """A simple point-to-point network. This doesn't not use garnet.
+    """
+
+    def __init__(self, ruby_system):
+        super(MyNetwork, self).__init__()
+        self.netifs = []
+        self.ruby_system = ruby_system
+
+    def connectControllers(self, controllers):
+        """Connect all of the controllers to routers and connec the routers
+           together in a point-to-point network.
+        """
+        # Create one router/switch per controller in the system
+        self.routers = [Switch(router_id = i) for i in range(len(controllers))]
+
+        # Make a link from each controller to the router. The link goes
+        # externally to the network.
+        self.ext_links = [SimpleExtLink(link_id=i, ext_node=c,
+                                        int_node=self.routers[i])
+                          for i, c in enumerate(controllers)]
+
+        # Make an "internal" link (internal to the network) between every pair
+        # of routers.
+        link_count = 0
+        self.int_links = []
+        for ri in self.routers:
+            for rj in self.routers:
+                if ri == rj: continue # Don't connect a router to itself!
+                link_count += 1
+                self.int_links.append(SimpleIntLink(link_id = link_count,
+                                                    src_node = ri,
+                                                    dst_node = rj))
diff --git a/src/npb/configs/system/MI_example_caches.py b/src/npb/configs/system/MI_example_caches.py
new file mode 100755
index 0000000..309ad89
--- /dev/null
+++ b/src/npb/configs/system/MI_example_caches.py
@@ -0,0 +1,280 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2015 Jason Power
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Power
+
+""" This file creates a set of Ruby caches, the Ruby network, and a simple
+point-to-point topology.
+See Part 3 in the Learning gem5 book: learning.gem5.org/book/part3
+You can change simple_ruby to import from this file instead of from msi_caches
+to use the MI_example protocol instead of MSI.
+
+IMPORTANT: If you modify this file, it's likely that the Learning gem5 book
+           also needs to be updated. For now, email Jason <jason@lowepower.com>
+
+"""
+
+from __future__ import print_function
+from __future__ import absolute_import
+
+import math
+
+from m5.defines import buildEnv
+from m5.util import fatal, panic
+
+from m5.objects import *
+
+class MIExampleSystem(RubySystem):
+
+    def __init__(self):
+        if buildEnv['PROTOCOL'] != 'MI_example':
+            fatal("This system assumes MI_example!")
+
+        super(MIExampleSystem, self).__init__()
+
+    def setup(self, system, cpus, mem_ctrls, dma_ports, iobus):
+        """Set up the Ruby cache subsystem. Note: This can't be done in the
+           constructor because many of these items require a pointer to the
+           ruby system (self). This causes infinite recursion in initialize()
+           if we do this in the __init__.
+        """
+        # Ruby's global network.
+        self.network = MyNetwork(self)
+
+        # MI example uses 5 virtual networks
+        self.number_of_virtual_networks = 5
+        self.network.number_of_virtual_networks = 5
+
+        # There is a single global list of all of the controllers to make it
+        # easier to connect everything to the global network. This can be
+        # customized depending on the topology/network requirements.
+        # Create one controller for each L1 cache (and the cache mem obj.)
+        # Create a single directory controller (Really the memory cntrl)
+        self.controllers = \
+            [L1Cache(system, self, cpu) for cpu in cpus] + \
+            [DirController(self, system.mem_ranges, mem_ctrls)] + \
+            [DMAController(self) for i in range(len(dma_ports))]
+
+        # Create one sequencer per CPU. In many systems this is more
+        # complicated since you have to create sequencers for DMA controllers
+        # and other controllers, too.
+        self.sequencers = [RubySequencer(version = i,
+                                # I/D cache is combined and grab from ctrl
+                                icache = self.controllers[i].cacheMemory,
+                                dcache = self.controllers[i].cacheMemory,
+                                clk_domain = self.controllers[i].clk_domain,
+                                pio_master_port = iobus.slave,
+                                mem_master_port = iobus.slave,
+                                pio_slave_port = iobus.master
+                                ) for i in range(len(cpus))] + \
+                          [DMASequencer(version = i,
+                                        slave = port)
+                            for i,port in enumerate(dma_ports)
+                          ]
+
+        for i,c in enumerate(self.controllers[0:len(cpus)]):
+            c.sequencer = self.sequencers[i]
+
+        for i,d in enumerate(self.controllers[-len(dma_ports):]):
+            i += len(cpus)
+            d.dma_sequencer = self.sequencers[i]
+
+        self.num_of_sequencers = len(self.sequencers)
+
+        # Create the network and connect the controllers.
+        # NOTE: This is quite different if using Garnet!
+        self.network.connectControllers(self.controllers)
+        self.network.setup_buffers()
+
+        # Set up a proxy port for the system_port. Used for load binaries and
+        # other functional-only things.
+        self.sys_port_proxy = RubyPortProxy()
+        system.system_port = self.sys_port_proxy.slave
+        self.sys_port_proxy.pio_master_port = iobus.slave
+
+        # Connect the cpu's cache, interrupt, and TLB ports to Ruby
+        for i,cpu in enumerate(cpus):
+            cpu.icache_port = self.sequencers[i].slave
+            cpu.dcache_port = self.sequencers[i].slave
+            cpu.createInterruptController()
+            isa = buildEnv['TARGET_ISA']
+            if isa == 'x86':
+                cpu.interrupts[0].pio = self.sequencers[i].master
+                cpu.interrupts[0].int_master = self.sequencers[i].slave
+                cpu.interrupts[0].int_slave = self.sequencers[i].master
+            if isa == 'x86' or isa == 'arm':
+                cpu.itb.walker.port = self.sequencers[i].slave
+                cpu.dtb.walker.port = self.sequencers[i].slave
+
+
+class L1Cache(L1Cache_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, system, ruby_system, cpu):
+        """CPUs are needed to grab the clock domain and system is needed for
+           the cache block size.
+        """
+        super(L1Cache, self).__init__()
+
+        self.version = self.versionCount()
+        # This is the cache memory object that stores the cache data and tags
+        self.cacheMemory = RubyCache(size = '16kB',
+                               assoc = 8,
+                               start_index_bit = self.getBlockSizeBits(system))
+        self.clk_domain = cpu.clk_domain
+        self.send_evictions = self.sendEvicts(cpu)
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def getBlockSizeBits(self, system):
+        bits = int(math.log(system.cache_line_size, 2))
+        if 2**bits != system.cache_line_size.value:
+            panic("Cache line size not a power of 2!")
+        return bits
+
+    def sendEvicts(self, cpu):
+        """True if the CPU model or ISA requires sending evictions from caches
+           to the CPU. Two scenarios warrant forwarding evictions to the CPU:
+           1. The O3 model must keep the LSQ coherent with the caches
+           2. The x86 mwait instruction is built on top of coherence
+           3. The local exclusive monitor in ARM systems
+        """
+        if type(cpu) is DerivO3CPU or \
+           buildEnv['TARGET_ISA'] in ('x86', 'arm'):
+            return True
+        return False
+
+    def connectQueues(self, ruby_system):
+        """Connect all of the queues for this controller.
+        """
+        self.mandatoryQueue = MessageBuffer()
+        self.requestFromCache = MessageBuffer(ordered = True)
+        self.requestFromCache.master = ruby_system.network.slave
+        self.responseFromCache = MessageBuffer(ordered = True)
+        self.responseFromCache.master = ruby_system.network.slave
+        self.forwardToCache = MessageBuffer(ordered = True)
+        self.forwardToCache.slave = ruby_system.network.master
+        self.responseToCache = MessageBuffer(ordered = True)
+        self.responseToCache.slave = ruby_system.network.master
+
+class DirController(Directory_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system, ranges, mem_ctrls):
+        """ranges are the memory ranges assigned to this controller.
+        """
+        if len(mem_ctrls) > 1:
+            panic("This cache system can only be connected to one mem ctrl")
+        super(DirController, self).__init__()
+        self.version = self.versionCount()
+        self.addr_ranges = ranges
+        self.ruby_system = ruby_system
+        self.directory = RubyDirectoryMemory()
+        # Connect this directory to the memory side.
+        self.memory = mem_ctrls[0].port
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.requestToDir = MessageBuffer(ordered = True)
+        self.requestToDir.slave = ruby_system.network.master
+        self.dmaRequestToDir = MessageBuffer(ordered = True)
+        self.dmaRequestToDir.slave = ruby_system.network.master
+
+        self.responseFromDir = MessageBuffer()
+        self.responseFromDir.master = ruby_system.network.slave
+        self.dmaResponseFromDir = MessageBuffer(ordered = True)
+        self.dmaResponseFromDir.master = ruby_system.network.slave
+        self.forwardFromDir = MessageBuffer()
+        self.forwardFromDir.master = ruby_system.network.slave
+        self.requestToMemory = MessageBuffer()
+        self.responseFromMemory = MessageBuffer()
+
+class DMAController(DMA_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system):
+        super(DMAController, self).__init__()
+        self.version = self.versionCount()
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.mandatoryQueue = MessageBuffer()
+        self.requestToDir = MessageBuffer()
+        self.requestToDir.master = ruby_system.network.slave
+        self.responseFromDir = MessageBuffer(ordered = True)
+        self.responseFromDir.slave = ruby_system.network.master
+
+
+class MyNetwork(SimpleNetwork):
+    """A simple point-to-point network. This doesn't not use garnet.
+    """
+
+    def __init__(self, ruby_system):
+        super(MyNetwork, self).__init__()
+        self.netifs = []
+        self.ruby_system = ruby_system
+
+    def connectControllers(self, controllers):
+        """Connect all of the controllers to routers and connec the routers
+           together in a point-to-point network.
+        """
+        # Create one router/switch per controller in the system
+        self.routers = [Switch(router_id = i) for i in range(len(controllers))]
+
+        # Make a link from each controller to the router. The link goes
+        # externally to the network.
+        self.ext_links = [SimpleExtLink(link_id=i, ext_node=c,
+                                        int_node=self.routers[i])
+                          for i, c in enumerate(controllers)]
+
+        # Make an "internal" link (internal to the network) between every pair
+        # of routers.
+        link_count = 0
+        self.int_links = []
+        for ri in self.routers:
+            for rj in self.routers:
+                if ri == rj: continue # Don't connect a router to itself!
+                link_count += 1
+                self.int_links.append(SimpleIntLink(link_id = link_count,
+                                                    src_node = ri,
+                                                    dst_node = rj))
diff --git a/src/npb/configs/system/MOESI_CMP_directory.py b/src/npb/configs/system/MOESI_CMP_directory.py
new file mode 100755
index 0000000..372b792
--- /dev/null
+++ b/src/npb/configs/system/MOESI_CMP_directory.py
@@ -0,0 +1,351 @@
+#Copyright (c) 2020 The Regents of the University of California.
+#All Rights Reserved
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+
+""" This file creates a set of Ruby caches for the MOESI CMP directory
+protocol.
+This protocol models two level cache hierarchy. The L1 cache is split into
+instruction and data cache.
+
+This system support the memory size of up to 3GB.
+
+"""
+
+from __future__ import print_function
+from __future__ import absolute_import
+
+import math
+
+from m5.defines import buildEnv
+from m5.util import fatal, panic
+
+from m5.objects import *
+
+class MOESICMPDirCache(RubySystem):
+
+    def __init__(self):
+        if buildEnv['PROTOCOL'] != 'MOESI_CMP_directory':
+            fatal("This system assumes MOESI_CMP_directory!")
+
+        super(MOESICMPDirCache, self).__init__()
+
+        self._numL2Caches = 8
+
+    def setup(self, system, cpus, mem_ctrls, dma_ports, iobus):
+        """Set up the Ruby cache subsystem. Note: This can't be done in the
+           constructor because many of these items require a pointer to the
+           ruby system (self). This causes infinite recursion in initialize()
+           if we do this in the __init__.
+        """
+        # Ruby's global network.
+        self.network = MyNetwork(self)
+
+        # MOESI_CMP_directory example uses 3 virtual networks
+        self.number_of_virtual_networks = 3
+        self.network.number_of_virtual_networks = 3
+
+        # There is a single global list of all of the controllers to make it
+        # easier to connect everything to the global network. This can be
+        # customized depending on the topology/network requirements.
+        # L1 caches are private to a core, hence there are one L1 cache per CPU
+        # core. The number of L2 caches are dependent to the architecture.
+        self.controllers = \
+            [L1Cache(system, self, cpu, self._numL2Caches) for cpu in cpus] + \
+            [L2Cache(system, self, self._numL2Caches) for num in \
+            range(self._numL2Caches)] + \
+            [DirController(self, system.mem_ranges, mem_ctrls)] + \
+            [DMAController(self) for i in range(len(dma_ports))]
+
+        # Create one sequencer per CPU and dma controller.
+        # Sequencers for other controllers can be here here.
+        self.sequencers = [RubySequencer(version = i,
+                                # I/D cache is combined and grab from ctrl
+                                icache = self.controllers[i].L1Icache,
+                                dcache = self.controllers[i].L1Dcache,
+                                clk_domain = self.controllers[i].clk_domain,
+                                pio_master_port = iobus.slave,
+                                mem_master_port = iobus.slave,
+                                pio_slave_port = iobus.master
+                                ) for i in range(len(cpus))] + \
+                          [DMASequencer(version = i,
+                                        slave = port)
+                            for i,port in enumerate(dma_ports)
+                          ]
+
+        for i,c in enumerate(self.controllers[:len(cpus)]):
+            c.sequencer = self.sequencers[i]
+
+        #Connecting the DMA sequencer to DMA controller
+        for i,d in enumerate(self.controllers[-len(dma_ports):]):
+            i += len(cpus)
+            d.dma_sequencer = self.sequencers[i]
+
+        self.num_of_sequencers = len(self.sequencers)
+
+        # Create the network and connect the controllers.
+        # NOTE: This is quite different if using Garnet!
+        self.network.connectControllers(self.controllers)
+        self.network.setup_buffers()
+
+        # Set up a proxy port for the system_port. Used for load binaries and
+        # other functional-only things.
+        self.sys_port_proxy = RubyPortProxy()
+        system.system_port = self.sys_port_proxy.slave
+        self.sys_port_proxy.pio_master_port = iobus.slave
+
+        # Connect the cpu's cache, interrupt, and TLB ports to Ruby
+        for i,cpu in enumerate(cpus):
+            cpu.icache_port = self.sequencers[i].slave
+            cpu.dcache_port = self.sequencers[i].slave
+            cpu.createInterruptController()
+            isa = buildEnv['TARGET_ISA']
+            if isa == 'x86':
+                cpu.interrupts[0].pio = self.sequencers[i].master
+                cpu.interrupts[0].int_master = self.sequencers[i].slave
+                cpu.interrupts[0].int_slave = self.sequencers[i].master
+            if isa == 'x86' or isa == 'arm':
+                cpu.itb.walker.port = self.sequencers[i].slave
+                cpu.dtb.walker.port = self.sequencers[i].slave
+
+
+class L1Cache(L1Cache_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, system, ruby_system, cpu, num_l2Caches):
+        """Creating L1 cache controller. Consist of both instruction
+           and data cache. The size of data cache is 512KB and
+           8-way set associative. The instruction cache is 32KB,
+           2-way set associative.
+        """
+        super(L1Cache, self).__init__()
+
+        self.version = self.versionCount()
+        block_size_bits = int(math.log(system.cache_line_size, 2))
+        l1i_size = '32kB'
+        l1i_assoc = '2'
+        l1d_size = '512kB'
+        l1d_assoc = '8'
+        # This is the cache memory object that stores the cache data and tags
+        self.L1Icache = RubyCache(size = l1i_size,
+                                assoc = l1i_assoc,
+                                start_index_bit = block_size_bits ,
+                                is_icache = True,
+                                dataAccessLatency = 1,
+                                tagAccessLatency = 1)
+        self.L1Dcache = RubyCache(size = l1d_size,
+                            assoc = l1d_assoc,
+                            start_index_bit = block_size_bits,
+                            is_icache = False,
+                            dataAccessLatency = 1,
+                            tagAccessLatency = 1)
+        self.clk_domain = cpu.clk_domain
+        self.prefetcher = RubyPrefetcher()
+        self.send_evictions = self.sendEvicts(cpu)
+        self.transitions_per_cycle = 4
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def getBlockSizeBits(self, system):
+        bits = int(math.log(system.cache_line_size, 2))
+        if 2**bits != system.cache_line_size.value:
+            panic("Cache line size not a power of 2!")
+        return bits
+
+    def sendEvicts(self, cpu):
+        """True if the CPU model or ISA requires sending evictions from caches
+           to the CPU. Two scenarios warrant forwarding evictions to the CPU:
+           1. The O3 model must keep the LSQ coherent with the caches
+           2. The x86 mwait instruction is built on top of coherence
+           3. The local exclusive monitor in ARM systems
+        """
+        if type(cpu) is DerivO3CPU or \
+           buildEnv['TARGET_ISA'] in ('x86', 'arm'):
+            return True
+        return False
+
+    def connectQueues(self, ruby_system):
+        """Connect all of the queues for this controller.
+        """
+        self.mandatoryQueue = MessageBuffer()
+        self.requestFromL1Cache = MessageBuffer()
+        self.requestFromL1Cache.master = ruby_system.network.slave
+        self.responseFromL1Cache = MessageBuffer()
+        self.responseFromL1Cache.master = ruby_system.network.slave
+        self.requestToL1Cache = MessageBuffer()
+        self.requestToL1Cache.slave = ruby_system.network.master
+        self.responseToL1Cache = MessageBuffer()
+        self.responseToL1Cache.slave = ruby_system.network.master
+        self.triggerQueue = MessageBuffer(ordered = True)
+
+class L2Cache(L2Cache_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, system, ruby_system, num_l2Caches):
+
+        super(L2Cache, self).__init__()
+
+        self.version = self.versionCount()
+        # This is the cache memory object that stores the cache data and tags
+        self.L2cache = RubyCache(size = '1 MB',
+                                assoc = 16,
+                                start_index_bit = self.getL2StartIdx(system,
+                                num_l2Caches),
+                                dataAccessLatency = 20,
+                                tagAccessLatency = 20)
+
+        self.transitions_per_cycle = '4'
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def getL2StartIdx(self, system, num_l2caches):
+        l2_bits = int(math.log(num_l2caches, 2))
+        bits = int(math.log(system.cache_line_size, 2)) + l2_bits
+        return bits
+
+
+    def connectQueues(self, ruby_system):
+        """Connect all of the queues for this controller.
+        """
+        self.GlobalRequestFromL2Cache = MessageBuffer()
+        self.GlobalRequestFromL2Cache.master = ruby_system.network.slave
+        self.L1RequestFromL2Cache = MessageBuffer()
+        self.L1RequestFromL2Cache.master = ruby_system.network.slave
+        self.responseFromL2Cache = MessageBuffer()
+        self.responseFromL2Cache.master = ruby_system.network.slave
+
+        self.GlobalRequestToL2Cache = MessageBuffer()
+        self.GlobalRequestToL2Cache.slave = ruby_system.network.master
+        self.L1RequestToL2Cache = MessageBuffer()
+        self.L1RequestToL2Cache.slave = ruby_system.network.master
+        self.responseToL2Cache = MessageBuffer()
+        self.responseToL2Cache.slave = ruby_system.network.master
+        self.triggerQueue = MessageBuffer(ordered = True)
+
+
+
+class DirController(Directory_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system, ranges, mem_ctrls):
+        """ranges are the memory ranges assigned to this controller.
+        """
+        if len(mem_ctrls) > 1:
+            panic("This cache system can only be connected to one mem ctrl")
+        super(DirController, self).__init__()
+        self.version = self.versionCount()
+        self.addr_ranges = ranges
+        self.ruby_system = ruby_system
+        self.directory = RubyDirectoryMemory()
+        # Connect this directory to the memory side.
+        self.memory = mem_ctrls[0].port
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.requestToDir = MessageBuffer()
+        self.requestToDir.slave = ruby_system.network.master
+        self.responseToDir = MessageBuffer()
+        self.responseToDir.slave = ruby_system.network.master
+        self.responseFromDir = MessageBuffer()
+        self.responseFromDir.master = ruby_system.network.slave
+        self.forwardFromDir = MessageBuffer()
+        self.forwardFromDir.master = ruby_system.network.slave
+        self.requestToMemory = MessageBuffer()
+        self.responseFromMemory = MessageBuffer()
+
+class DMAController(DMA_Controller):
+
+    _version = 0
+    @classmethod
+    def versionCount(cls):
+        cls._version += 1 # Use count for this particular type
+        return cls._version - 1
+
+    def __init__(self, ruby_system):
+        super(DMAController, self).__init__()
+        self.version = self.versionCount()
+        self.ruby_system = ruby_system
+        self.connectQueues(ruby_system)
+
+    def connectQueues(self, ruby_system):
+        self.mandatoryQueue = MessageBuffer()
+        self.responseFromDir = MessageBuffer()
+        self.responseFromDir.slave = ruby_system.network.master
+        self.reqToDir = MessageBuffer()
+        self.reqToDir.master = ruby_system.network.slave
+        self.respToDir = MessageBuffer()
+        self.respToDir.master = ruby_system.network.slave
+        self.triggerQueue = MessageBuffer(ordered = True)
+
+
+class MyNetwork(SimpleNetwork):
+    """A simple point-to-point network. This doesn't not use garnet.
+    """
+
+    def __init__(self, ruby_system):
+        super(MyNetwork, self).__init__()
+        self.netifs = []
+        self.ruby_system = ruby_system
+
+    def connectControllers(self, controllers):
+        """Connect all of the controllers to routers and connec the routers
+           together in a point-to-point network.
+        """
+        # Create one router/switch per controller in the system
+        self.routers = [Switch(router_id = i) for i in range(len(controllers))]
+
+        # Make a link from each controller to the router. The link goes
+        # externally to the network.
+        self.ext_links = [SimpleExtLink(link_id=i, ext_node=c,
+                                        int_node=self.routers[i])
+                          for i, c in enumerate(controllers)]
+
+        # Make an "internal" link (internal to the network) between every pair
+        # of routers.
+        link_count = 0
+        self.int_links = []
+        for ri in self.routers:
+            for rj in self.routers:
+                if ri == rj: continue # Don't connect a router to itself!
+                link_count += 1
+                self.int_links.append(SimpleIntLink(link_id = link_count,
+                                                    src_node = ri,
+                                                    dst_node = rj))
diff --git a/src/npb/configs/system/__init__.py b/src/npb/configs/system/__init__.py
new file mode 100755
index 0000000..3b71680
--- /dev/null
+++ b/src/npb/configs/system/__init__.py
@@ -0,0 +1,31 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2016 Jason Lowe-Power
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power
+
+from system import MySystem
+from ruby_system import MyRubySystem
diff --git a/src/npb/configs/system/caches.py b/src/npb/configs/system/caches.py
new file mode 100755
index 0000000..4630cea
--- /dev/null
+++ b/src/npb/configs/system/caches.py
@@ -0,0 +1,202 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2016 Jason Lowe-Power
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power
+
+""" Caches with options for a simple gem5 configuration script
+
+This file contains L1 I/D and L2 caches to be used in the simple
+gem5 configuration script. It uses the SimpleOpts wrapper to set up command
+line options from each individual class.
+"""
+
+import m5
+from m5.objects import Cache, L2XBar, StridePrefetcher, SubSystem
+from m5.params import AddrRange, AllMemory, MemorySize
+from m5.util.convert import toMemorySize
+
+import SimpleOpts
+
+# Some specific options for caches
+# For all options see src/mem/cache/BaseCache.py
+
+class PrefetchCache(Cache):
+
+    SimpleOpts.add_option("--no_prefetchers", default=False,
+                          action="store_true",
+                          help="Enable prefectchers on the caches")
+
+    def __init__(self, options):
+        super(PrefetchCache, self).__init__()
+        if not options or options.no_prefetchers:
+            return
+        self.prefetcher = StridePrefetcher()
+
+class L1Cache(PrefetchCache):
+    """Simple L1 Cache with default values"""
+
+    assoc = 8
+    tag_latency = 1
+    data_latency = 1
+    response_latency = 1
+    mshrs = 16
+    tgts_per_mshr = 20
+    writeback_clean = True
+
+    def __init__(self, options=None):
+        super(L1Cache, self).__init__(options)
+        pass
+
+    def connectBus(self, bus):
+        """Connect this cache to a memory-side bus"""
+        self.mem_side = bus.slave
+
+    def connectCPU(self, cpu):
+        """Connect this cache's port to a CPU-side port
+           This must be defined in a subclass"""
+        raise NotImplementedError
+
+class L1ICache(L1Cache):
+    """Simple L1 instruction cache with default values"""
+
+    # Set the default size
+    size = '32kB'
+
+    SimpleOpts.add_option('--l1i_size',
+                        help="L1 instruction cache size. Default: %s" % size)
+
+    def __init__(self, opts=None):
+        super(L1ICache, self).__init__(opts)
+        if not opts or not opts.l1i_size:
+            return
+        self.size = opts.l1i_size
+
+    def connectCPU(self, cpu):
+        """Connect this cache's port to a CPU icache port"""
+        self.cpu_side = cpu.icache_port
+
+class L1DCache(L1Cache):
+    """Simple L1 data cache with default values"""
+
+    # Set the default size
+    size = '32kB'
+
+    SimpleOpts.add_option('--l1d_size',
+                          help="L1 data cache size. Default: %s" % size)
+
+    def __init__(self, opts=None):
+        super(L1DCache, self).__init__(opts)
+        if not opts or not opts.l1d_size:
+            return
+        self.size = opts.l1d_size
+
+    def connectCPU(self, cpu):
+        """Connect this cache's port to a CPU dcache port"""
+        self.cpu_side = cpu.dcache_port
+
+class MMUCache(Cache):
+    # Default parameters
+    size = '8kB'
+    assoc = 4
+    tag_latency = 1
+    data_latency = 1
+    response_latency = 1
+    mshrs = 20
+    tgts_per_mshr = 12
+    writeback_clean = True
+
+    def __init__(self):
+        super(MMUCache, self).__init__()
+
+    def connectCPU(self, cpu):
+        """Connect the CPU itb and dtb to the cache
+           Note: This creates a new crossbar
+        """
+        self.mmubus = L2XBar()
+        self.cpu_side = self.mmubus.master
+        for tlb in [cpu.itb, cpu.dtb]:
+            self.mmubus.slave = tlb.walker.port
+
+    def connectBus(self, bus):
+        """Connect this cache to a memory-side bus"""
+        self.mem_side = bus.slave
+
+class L2Cache(PrefetchCache):
+    """Simple L2 Cache with default values"""
+
+    # Default parameters
+    size = '256kB'
+    assoc = 16
+    tag_latency = 10
+    data_latency = 10
+    response_latency = 1
+    mshrs = 20
+    tgts_per_mshr = 12
+    writeback_clean = True
+
+    SimpleOpts.add_option('--l2_size',
+                          help="L2 cache size. Default: %s" % size)
+
+    def __init__(self, opts=None):
+        super(L2Cache, self).__init__(opts)
+        if not opts or not opts.l2_size:
+            return
+        self.size = opts.l2_size
+
+    def connectCPUSideBus(self, bus):
+        self.cpu_side = bus.master
+
+    def connectMemSideBus(self, bus):
+        self.mem_side = bus.slave
+
+class L3Cache(Cache):
+    """Simple L3 Cache bank with default values
+       This assumes that the L3 is made up of multiple banks. This cannot
+       be used as a standalone L3 cache.
+    """
+
+    SimpleOpts.add_option('--l3_size', default = '4MB',
+                          help="L3 cache size. Default: 4MB")
+
+    # Default parameters
+    assoc = 32
+    tag_latency = 40
+    data_latency = 40
+    response_latency = 10
+    mshrs = 256
+    tgts_per_mshr = 12
+    clusivity = 'mostly_excl'
+
+    def __init__(self, opts):
+        super(L3Cache, self).__init__()
+        self.size = (opts.l3_size)
+
+    def connectCPUSideBus(self, bus):
+        self.cpu_side = bus.master
+
+    def connectMemSideBus(self, bus):
+        self.mem_side = bus.slave
diff --git a/src/npb/configs/system/fs_tools.py b/src/npb/configs/system/fs_tools.py
new file mode 100755
index 0000000..22a43d2
--- /dev/null
+++ b/src/npb/configs/system/fs_tools.py
@@ -0,0 +1,39 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2016 Jason Lowe-Power
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power
+
+from m5.objects import IdeDisk, CowDiskImage, RawDiskImage
+
+class CowDisk(IdeDisk):
+
+    def __init__(self, filename):
+        super(CowDisk, self).__init__()
+        self.driveID = 'master'
+        self.image = CowDiskImage(child=RawDiskImage(read_only=True),
+                                  read_only=False)
+        self.image.child.image_file = filename
diff --git a/src/npb/configs/system/ruby_system.py b/src/npb/configs/system/ruby_system.py
new file mode 100755
index 0000000..37be01e
--- /dev/null
+++ b/src/npb/configs/system/ruby_system.py
@@ -0,0 +1,236 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2016 Jason Lowe-Power
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power
+
+import m5
+from m5.objects import *
+from m5.util import convert
+from fs_tools import *
+
+
+class MyRubySystem(System):
+
+    def __init__(self, kernel, disk, mem_sys, num_cpus, opts):
+        super(MyRubySystem, self).__init__()
+        self._opts = opts
+
+        self._host_parallel = True
+
+        # Set up the clock domain and the voltage domain
+        self.clk_domain = SrcClockDomain()
+        self.clk_domain.clock = '3GHz'
+        self.clk_domain.voltage_domain = VoltageDomain()
+
+        self.mem_ranges = [AddrRange(Addr('3GB')), # All data
+                           AddrRange(0xC0000000, size=0x100000), # For I/0
+                           ]
+
+        self.initFS(num_cpus)
+
+        # Replace these paths with the path to your disk images.
+        # The first disk is the root disk. The second could be used for swap
+        # or anything else.
+        self.setDiskImages(disk, disk)
+
+        # Change this path to point to the kernel you want to use
+        self.workload.object_file = kernel
+        # Options specified on the kernel command line
+        boot_options = ['earlyprintk=ttyS0', 'console=ttyS0', 'lpj=7999923',
+                         'root=/dev/hda1']
+
+        self.workload.command_line = ' '.join(boot_options)
+
+        # Create the CPUs for our system.
+        self.createCPU(num_cpus)
+
+        self.createMemoryControllersDDR3()
+
+        # Create the cache hierarchy for the system.
+        if mem_sys == 'MI_example':
+            from MI_example_caches import MIExampleSystem
+            self.caches = MIExampleSystem()
+        elif mem_sys == 'MESI_Two_Level':
+            from MESI_Two_Level import MESITwoLevelCache
+            self.caches = MESITwoLevelCache()
+        elif mem_sys == 'MOESI_CMP_directory':
+            from MOESI_CMP_directory import MOESICMPDirCache
+            self.caches = MOESICMPDirCache()
+        self.caches.setup(self, self.cpu, self.mem_cntrls,
+                          [self.pc.south_bridge.ide.dma, self.iobus.master],
+                          self.iobus)
+
+        if self._host_parallel:
+            # To get the KVM CPUs to run on different host CPUs
+            # Specify a different event queue for each CPU
+            for i,cpu in enumerate(self.cpu):
+                for obj in cpu.descendants():
+                    obj.eventq_index = 0
+
+                # the number of eventqs are set based
+                # on experiments with few benchmarks
+
+                cpu.eventq_index = i + 1
+
+    def getHostParallel(self):
+        return self._host_parallel
+
+    def totalInsts(self):
+        return sum([cpu.totalInsts() for cpu in self.cpu])
+
+    def createCPU(self, num_cpus):
+
+        # Note KVM needs a VM and atomic_noncaching
+        self.cpu = [X86KvmCPU(cpu_id = i)
+                    for i in range(num_cpus)]
+        self.kvm_vm = KvmVM()
+        self.mem_mode = 'atomic_noncaching'
+        map(lambda c: c.createThreads(), self.cpu)
+
+        self.atomicCpu = [AtomicSimpleCPU(cpu_id = i,
+                                            switched_out = True)
+                            for i in range(num_cpus)]
+        map(lambda c: c.createThreads(), self.atomicCpu)
+
+        self.timingCpu = [TimingSimpleCPU(cpu_id = i,
+                                     switched_out = True)
+				   for i in range(num_cpus)]
+        map(lambda c: c.createThreads(), self.timingCpu)
+
+    def switchCpus(self, old, new):
+        assert(new[0].switchedOut())
+        m5.switchCpus(self, zip(old, new))
+
+    def setDiskImages(self, img_path_1, img_path_2):
+        disk0 = CowDisk(img_path_1)
+        disk2 = CowDisk(img_path_2)
+        self.pc.south_bridge.ide.disks = [disk0, disk2]
+
+    def createMemoryControllersDDR3(self):
+        self._createMemoryControllers(1, DDR3_1600_8x8)
+
+    def _createMemoryControllers(self, num, cls):
+        self.mem_cntrls = [
+            cls(range = self.mem_ranges[0])
+            for i in range(num)
+        ]
+
+    def initFS(self, cpus):
+        self.pc = Pc()
+
+        self.workload = X86FsLinux()
+
+        # North Bridge
+        self.iobus = IOXBar()
+
+        # connect the io bus
+        # Note: pass in a reference to where Ruby will connect to in the future
+        # so the port isn't connected twice.
+        self.pc.attachIO(self.iobus, [self.pc.south_bridge.ide.dma])
+
+        self.intrctrl = IntrControl()
+
+        ###############################################
+
+        # Add in a Bios information structure.
+        self.workload.smbios_table.structures = [X86SMBiosBiosInformation()]
+
+        # Set up the Intel MP table
+        base_entries = []
+        ext_entries = []
+        for i in range(cpus):
+            bp = X86IntelMPProcessor(
+                    local_apic_id = i,
+                    local_apic_version = 0x14,
+                    enable = True,
+                    bootstrap = (i ==0))
+            base_entries.append(bp)
+        io_apic = X86IntelMPIOAPIC(
+                id = cpus,
+                version = 0x11,
+                enable = True,
+                address = 0xfec00000)
+        self.pc.south_bridge.io_apic.apic_id = io_apic.id
+        base_entries.append(io_apic)
+        pci_bus = X86IntelMPBus(bus_id = 0, bus_type='PCI   ')
+        base_entries.append(pci_bus)
+        isa_bus = X86IntelMPBus(bus_id = 1, bus_type='ISA   ')
+        base_entries.append(isa_bus)
+        connect_busses = X86IntelMPBusHierarchy(bus_id=1,
+                subtractive_decode=True, parent_bus=0)
+        ext_entries.append(connect_busses)
+        pci_dev4_inta = X86IntelMPIOIntAssignment(
+                interrupt_type = 'INT',
+                polarity = 'ConformPolarity',
+                trigger = 'ConformTrigger',
+                source_bus_id = 0,
+                source_bus_irq = 0 + (4 << 2),
+                dest_io_apic_id = io_apic.id,
+                dest_io_apic_intin = 16)
+        base_entries.append(pci_dev4_inta)
+        def assignISAInt(irq, apicPin):
+            assign_8259_to_apic = X86IntelMPIOIntAssignment(
+                    interrupt_type = 'ExtInt',
+                    polarity = 'ConformPolarity',
+                    trigger = 'ConformTrigger',
+                    source_bus_id = 1,
+                    source_bus_irq = irq,
+                    dest_io_apic_id = io_apic.id,
+                    dest_io_apic_intin = 0)
+            base_entries.append(assign_8259_to_apic)
+            assign_to_apic = X86IntelMPIOIntAssignment(
+                    interrupt_type = 'INT',
+                    polarity = 'ConformPolarity',
+                    trigger = 'ConformTrigger',
+                    source_bus_id = 1,
+                    source_bus_irq = irq,
+                    dest_io_apic_id = io_apic.id,
+                    dest_io_apic_intin = apicPin)
+            base_entries.append(assign_to_apic)
+        assignISAInt(0, 2)
+        assignISAInt(1, 1)
+        for i in range(3, 15):
+            assignISAInt(i, i)
+        self.workload.intel_mp_table.base_entries = base_entries
+        self.workload.intel_mp_table.ext_entries = ext_entries
+
+        entries = \
+           [
+            # Mark the first megabyte of memory as reserved
+            X86E820Entry(addr = 0, size = '639kB', range_type = 1),
+            X86E820Entry(addr = 0x9fc00, size = '385kB', range_type = 2),
+            # Mark the rest of physical memory as available
+            X86E820Entry(addr = 0x100000,
+                    size = '%dB' % (self.mem_ranges[0].size() - 0x100000),
+                    range_type = 1),
+            ]
+
+        # Reserve the last 16kB of the 32-bit address space for m5ops
+        entries.append(X86E820Entry(addr = 0xFFFF0000, size = '64kB',
+                                    range_type=2))
+
+        self.workload.e820_table.entries = entries
diff --git a/src/npb/configs/system/system.py b/src/npb/configs/system/system.py
new file mode 100755
index 0000000..4b154df
--- /dev/null
+++ b/src/npb/configs/system/system.py
@@ -0,0 +1,398 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2018 The Regents of the University of California
+# All Rights Reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met: redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer;
+# redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution;
+# neither the name of the copyright holders nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Jason Lowe-Power
+
+import m5
+from m5.objects import *
+from m5.util import convert
+from fs_tools import *
+from caches import *
+
+
+class MySystem(System):
+
+    SimpleOpts.add_option("--no_host_parallel", default=False,
+                action="store_true",
+                help="Do NOT run gem5 on multiple host threads (kvm only)")
+
+    SimpleOpts.add_option("--second_disk", default='',
+                          help="The second disk image to mount (/dev/hdb)")
+
+    def __init__(self, kernel, disk, num_cpus, opts, no_kvm=False):
+        super(MySystem, self).__init__()
+        self._opts = opts
+        self._no_kvm = no_kvm
+
+        self._host_parallel = not self._opts.no_host_parallel
+
+        # Set up the clock domain and the voltage domain
+        self.clk_domain = SrcClockDomain()
+        self.clk_domain.clock = '2.3GHz'
+        self.clk_domain.voltage_domain = VoltageDomain()
+
+        mem_size = '32GB'
+        self.mem_ranges = [AddrRange('100MB'), # For kernel
+                           AddrRange(0xC0000000, size=0x100000), # For I/0
+                           AddrRange(Addr('4GB'), size = mem_size) # All data
+                           ]
+
+        # Create the main memory bus
+        # This connects to main memory
+        self.membus = SystemXBar(width = 64) # 64-byte width
+        self.membus.badaddr_responder = BadAddr()
+        self.membus.default = Self.badaddr_responder.pio
+
+        # Set up the system port for functional access from the simulator
+        self.system_port = self.membus.slave
+
+        self.initFS(self.membus, num_cpus)
+
+
+        # Replace these paths with the path to your disk images.
+        # The first disk is the root disk. The second could be used for swap
+        # or anything else.
+
+        self.setDiskImages(disk, disk)
+
+	if opts.second_disk:
+            self.setDiskImages(disk, opts.second_disk)
+        else:
+            self.setDiskImages(disk, disk)
+
+        # Change this path to point to the kernel you want to use
+        self.workload.object_file = kernel
+        # Options specified on the kernel command line
+        boot_options = ['earlyprintk=ttyS0', 'console=ttyS0', 'lpj=7999923',
+                         'root=/dev/hda1']
+
+        self.workload.command_line = ' '.join(boot_options)
+
+        # Create the CPUs for our system.
+        self.createCPU(num_cpus)
+
+        # Create the cache heirarchy for the system.
+        self.createCacheHierarchy()
+
+        # Set up the interrupt controllers for the system (x86 specific)
+        self.setupInterrupts()
+
+        self.createMemoryControllersDDR4()
+
+        if self._host_parallel:
+            # To get the KVM CPUs to run on different host CPUs
+            # Specify a different event queue for each CPU
+            for i,cpu in enumerate(self.cpu):
+                for obj in cpu.descendants():
+                    obj.eventq_index = 0
+
+                # the number of eventqs are set based
+                # on experiments with few benchmarks
+
+                if len(self.cpu) > 16:
+                    cpu.eventq_index = (i/4) + 1
+                else:
+                    cpu.eventq_index = (i/2) + 1
+    def getHostParallel(self):
+        return self._host_parallel
+
+    def totalInsts(self):
+        return sum([cpu.totalInsts() for cpu in self.cpu])
+
+    def createCPU(self, num_cpus):
+        if self._no_kvm:
+            self.cpu = [AtomicSimpleCPU(cpu_id = i, switched_out = False)
+                              for i in range(num_cpus)]
+            map(lambda c: c.createThreads(), self.cpu)
+            self.mem_mode = 'timing'
+
+        else:
+            # Note KVM needs a VM and atomic_noncaching
+            self.cpu = [X86KvmCPU(cpu_id = i)
+                        for i in range(num_cpus)]
+            map(lambda c: c.createThreads(), self.cpu)
+            self.kvm_vm = KvmVM()
+            self.mem_mode = 'atomic_noncaching'
+
+            self.atomicCpu = [AtomicSimpleCPU(cpu_id = i,
+                                              switched_out = True)
+                              for i in range(num_cpus)]
+            map(lambda c: c.createThreads(), self.atomicCpu)
+
+        self.timingCpu = [TimingSimpleCPU(cpu_id = i,
+                                     switched_out = True)
+				   for i in range(num_cpus)]
+
+        map(lambda c: c.createThreads(), self.timingCpu)
+
+    def switchCpus(self, old, new):
+        assert(new[0].switchedOut())
+        m5.switchCpus(self, zip(old, new))
+
+    def setDiskImages(self, img_path_1, img_path_2):
+        disk0 = CowDisk(img_path_1)
+        disk2 = CowDisk(img_path_2)
+        self.pc.south_bridge.ide.disks = [disk0, disk2]
+
+    def createCacheHierarchy(self):
+        # Create an L3 cache (with crossbar)
+        self.l3bus = L2XBar(width = 64,
+                            snoop_filter = SnoopFilter(max_capacity='32MB'))
+
+        for cpu in self.cpu:
+            # Create a memory bus, a coherent crossbar, in this case
+            cpu.l2bus = L2XBar()
+
+            # Create an L1 instruction and data cache
+            cpu.icache = L1ICache(self._opts)
+            cpu.dcache = L1DCache(self._opts)
+            cpu.mmucache = MMUCache()
+
+            # Connect the instruction and data caches to the CPU
+            cpu.icache.connectCPU(cpu)
+            cpu.dcache.connectCPU(cpu)
+            cpu.mmucache.connectCPU(cpu)
+
+            # Hook the CPU ports up to the l2bus
+            cpu.icache.connectBus(cpu.l2bus)
+            cpu.dcache.connectBus(cpu.l2bus)
+            cpu.mmucache.connectBus(cpu.l2bus)
+
+            # Create an L2 cache and connect it to the l2bus
+            cpu.l2cache = L2Cache(self._opts)
+            cpu.l2cache.connectCPUSideBus(cpu.l2bus)
+
+            # Connect the L2 cache to the L3 bus
+            cpu.l2cache.connectMemSideBus(self.l3bus)
+
+        self.l3cache = L3Cache(self._opts)
+        self.l3cache.connectCPUSideBus(self.l3bus)
+
+        # Connect the L3 cache to the membus
+        self.l3cache.connectMemSideBus(self.membus)
+
+    def setupInterrupts(self):
+        for cpu in self.cpu:
+            # create the interrupt controller CPU and connect to the membus
+            cpu.createInterruptController()
+
+            # For x86 only, connect interrupts to the memory
+            # Note: these are directly connected to the memory bus and
+            #       not cached
+            cpu.interrupts[0].pio = self.membus.master
+            cpu.interrupts[0].int_master = self.membus.slave
+            cpu.interrupts[0].int_slave = self.membus.master
+
+    # Memory latency: Using the smaller number from [3]: 96ns
+    def createMemoryControllersDDR4(self):
+        self._createMemoryControllers(8, DDR4_2400_16x4)
+
+    def _createMemoryControllers(self, num, cls):
+        kernel_controller = self._createKernelMemoryController(cls)
+
+        ranges = self._getInterleaveRanges(self.mem_ranges[-1], num, 7, 20)
+
+        self.mem_cntrls = [
+            cls(range = ranges[i],
+                port = self.membus.master)
+            for i in range(num)
+        ] + [kernel_controller]
+
+    def _createKernelMemoryController(self, cls):
+        return cls(range = self.mem_ranges[0],
+                   port = self.membus.master)
+
+    def _getInterleaveRanges(self, rng, num, intlv_low_bit, xor_low_bit):
+        from math import log
+        bits = int(log(num, 2))
+        if 2**bits != num:
+            m5.fatal("Non-power of two number of memory controllers")
+
+        intlv_bits = bits
+        ranges = [
+            AddrRange(start=rng.start,
+                      end=rng.end,
+                      intlvHighBit = intlv_low_bit + intlv_bits - 1,
+                      xorHighBit = xor_low_bit + intlv_bits - 1,
+                      intlvBits = intlv_bits,
+                      intlvMatch = i)
+                for i in range(num)
+            ]
+
+        return ranges
+
+    def initFS(self, membus, cpus):
+        self.pc = Pc()
+        self.workload = X86FsLinux()
+
+        # Constants similar to x86_traits.hh
+        IO_address_space_base = 0x8000000000000000
+        pci_config_address_space_base = 0xc000000000000000
+        interrupts_address_space_base = 0xa000000000000000
+        APIC_range_size = 1 << 12;
+
+        # North Bridge
+        self.iobus = IOXBar()
+        self.bridge = Bridge(delay='50ns')
+        self.bridge.master = self.iobus.slave
+        self.bridge.slave = membus.master
+        # Allow the bridge to pass through:
+        #  1) kernel configured PCI device memory map address: address range
+        #  [0xC0000000, 0xFFFF0000). (The upper 64kB are reserved for m5ops.)
+        #  2) the bridge to pass through the IO APIC (two pages, already
+        #     contained in 1),
+        #  3) everything in the IO address range up to the local APIC, and
+        #  4) then the entire PCI address space and beyond.
+        self.bridge.ranges = \
+            [
+            AddrRange(0xC0000000, 0xFFFF0000),
+            AddrRange(IO_address_space_base,
+                      interrupts_address_space_base - 1),
+            AddrRange(pci_config_address_space_base,
+                      Addr.max)
+            ]
+
+        # Create a bridge from the IO bus to the memory bus to allow access
+        # to the local APIC (two pages)
+        self.apicbridge = Bridge(delay='50ns')
+        self.apicbridge.slave = self.iobus.master
+        self.apicbridge.master = membus.slave
+        self.apicbridge.ranges = [AddrRange(interrupts_address_space_base,
+                                            interrupts_address_space_base +
+                                            cpus * APIC_range_size
+                                            - 1)]
+
+        # connect the io bus
+        self.pc.attachIO(self.iobus)
+
+        # Add a tiny cache to the IO bus.
+        # This cache is required for the classic memory model for coherence
+        self.iocache = Cache(assoc=8,
+                            tag_latency = 50,
+                            data_latency = 50,
+                            response_latency = 50,
+                            mshrs = 20,
+                            size = '1kB',
+                            tgts_per_mshr = 12,
+                            addr_ranges = self.mem_ranges)
+        self.iocache.cpu_side = self.iobus.master
+        self.iocache.mem_side = self.membus.slave
+
+        self.intrctrl = IntrControl()
+
+        ###############################################
+
+        # Add in a Bios information structure.
+        self.workload.smbios_table.structures = [X86SMBiosBiosInformation()]
+
+        # Set up the Intel MP table
+        base_entries = []
+        ext_entries = []
+        for i in range(cpus):
+            bp = X86IntelMPProcessor(
+                    local_apic_id = i,
+                    local_apic_version = 0x14,
+                    enable = True,
+                    bootstrap = (i ==0))
+            base_entries.append(bp)
+        io_apic = X86IntelMPIOAPIC(
+                id = cpus,
+                version = 0x11,
+                enable = True,
+                address = 0xfec00000)
+        self.pc.south_bridge.io_apic.apic_id = io_apic.id
+        base_entries.append(io_apic)
+        pci_bus = X86IntelMPBus(bus_id = 0, bus_type='PCI   ')
+        base_entries.append(pci_bus)
+        isa_bus = X86IntelMPBus(bus_id = 1, bus_type='ISA   ')
+        base_entries.append(isa_bus)
+        connect_busses = X86IntelMPBusHierarchy(bus_id=1,
+                subtractive_decode=True, parent_bus=0)
+        ext_entries.append(connect_busses)
+        pci_dev4_inta = X86IntelMPIOIntAssignment(
+                interrupt_type = 'INT',
+                polarity = 'ConformPolarity',
+                trigger = 'ConformTrigger',
+                source_bus_id = 0,
+                source_bus_irq = 0 + (4 << 2),
+                dest_io_apic_id = io_apic.id,
+                dest_io_apic_intin = 16)
+        base_entries.append(pci_dev4_inta)
+        def assignISAInt(irq, apicPin):
+            assign_8259_to_apic = X86IntelMPIOIntAssignment(
+                    interrupt_type = 'ExtInt',
+                    polarity = 'ConformPolarity',
+                    trigger = 'ConformTrigger',
+                    source_bus_id = 1,
+                    source_bus_irq = irq,
+                    dest_io_apic_id = io_apic.id,
+                    dest_io_apic_intin = 0)
+            base_entries.append(assign_8259_to_apic)
+            assign_to_apic = X86IntelMPIOIntAssignment(
+                    interrupt_type = 'INT',
+                    polarity = 'ConformPolarity',
+                    trigger = 'ConformTrigger',
+                    source_bus_id = 1,
+                    source_bus_irq = irq,
+                    dest_io_apic_id = io_apic.id,
+                    dest_io_apic_intin = apicPin)
+            base_entries.append(assign_to_apic)
+        assignISAInt(0, 2)
+        assignISAInt(1, 1)
+        for i in range(3, 15):
+            assignISAInt(i, i)
+        self.workload.intel_mp_table.base_entries = base_entries
+        self.workload.intel_mp_table.ext_entries = ext_entries
+
+        entries = \
+           [
+            # Mark the first megabyte of memory as reserved
+            X86E820Entry(addr = 0, size = '639kB', range_type = 1),
+            X86E820Entry(addr = 0x9fc00, size = '385kB', range_type = 2),
+            # Mark the rest of physical memory as available
+            X86E820Entry(addr = 0x100000,
+                    size = '%dB' % (self.mem_ranges[0].size() - 0x100000),
+                    range_type = 1),
+            ]
+        # Mark [mem_size, 3GB) as reserved if memory less than 3GB, which
+        # force IO devices to be mapped to [0xC0000000, 0xFFFF0000). Requests
+        # to this specific range can pass though bridge to iobus.
+        entries.append(X86E820Entry(addr = self.mem_ranges[0].size(),
+            size='%dB' % (0xC0000000 - self.mem_ranges[0].size()),
+            range_type=2))
+
+        # Reserve the last 16kB of the 32-bit address space for m5ops
+        entries.append(X86E820Entry(addr = 0xFFFF0000, size = '64kB',
+                                    range_type=2))
+
+        # Add the rest of memory. This is where all the actual data is
+        entries.append(X86E820Entry(addr = self.mem_ranges[-1].start,
+            size='%dB' % (self.mem_ranges[-1].size()),
+            range_type=1))
+
+        self.workload.e820_table.entries = entries
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/Changes.log b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/Changes.log
new file mode 100644
index 0000000..ccf61a3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/Changes.log
@@ -0,0 +1,428 @@
+###########################################
+# Modification History of NPB3.x          #
+# ------------------------------          #
+#   NPB development team                  #
+#   NASA Ames Research Center             #
+#   npb@nas.nasa.gov                      #
+#   http://www.nas.nasa.gov/Software/NPB/ #
+###########################################
+
+------------------------------------------------------
+Changes in NPB3.3.1
+      (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI )
+------------------------------------------------------
+[17-Feb-09]
+
+This is a bug fixing release of NPB3.3.
+
+1. All versions
+
+ - sys/setparams.c: fixed a problem in dealing with quoted (") flags
+   from make.def when producing npbparams.h for C.
+
+ - CG: ensure 'implicit none' used in all subroutines.
+
+2. MPI version
+
+ - Additional timers can be used for profiling purpose, similar
+   to those already included in the OMP and SER versions.
+
+ - LU:
+   * code clean up (suggested by Rob Van der Wijngaart)
+      > avoid using MPI_ANY_SOURCE in exchange_*.f, which might 
+        alter performance in some cases.
+      > delete references to sethyper and 'icomm*', which are 
+        no longer used since NPB2.2.
+   * change the low-bound limit on the sub-domain size in subdomain.f
+     from 4 to 3 in order to increase allowable process counts.
+   * allow number of processes other than power of two.
+
+ - FT: fix a non-portable way of broadcasting input parameters
+      (pointed out by Art Lazanoff)
+
+ - BT: include 'btio_cleanup' as part of the I/O timing
+
+3. OMP and SER versions
+
+ - DC: fix access to out-of-bound array elements in adc.c
+      Reported by Per Larsen of Demark <pl@imm.dtu.dk>
+
+ - UA: fix the use of uninitialized array 'sje' in mortar_vertex() by
+      adding "call nr_init[_omp](sje,4*6*nelt,0)" in the main program.
+
+ - MG, UA: include additional timers for profiling purpose.
+
+ - Executables now use ".x" as a name extension
+
+
+------------------------------------------------------
+Changes in NPB3.3
+      (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI )
+------------------------------------------------------
+[02-Aug-07]
+
+1. New and improvements
+
+ - The Class E problem has been introduced in seven of the benchmarks
+   (BT, SP, LU, CG, MG, FT, and EP) in all three implementations.
+
+ - The Class D problem has been added to the IS benchmark in all 
+   three implementations.  It requires the compiler support of 
+   64-bit "long" type in C.  The MPI version of IS now allows runs 
+   up to 1024 processes.
+
+ - The Bucket Sort option (USE_BUCKETS) has been added to
+   the OpenMP version of IS and made as the default.
+
+ - Introduced the "twiddle" array in the OpenMP FT benchmark,
+   which has been used in the MPI and SER versions and seems 
+   to improve performance for larger problem sizes.
+
+ - Merged vector codes for the BT and LU benchmarks into
+   the release.
+
+ - Updates to BTIO (MPI/BT with IO subtypes):
+    * added I/O stats (I/O timing, data size written, I/O data rate)
+    * added an option for interleaving reads between writes through
+      the inputbt.data file.  Although the data file size would be
+      smaller as a result, the total amount of data written is still
+      the same.
+
+ - Made documents more consistent throughout different versions
+   (README and README.install).
+
+2. Bug fixes
+
+ - MPI/FT: fixed a verification failure for cases where NX/=NY 
+   and the 2D decomposition are used.  The bug occurred at least
+   for (Class D, NPROCS=2048) and (Class B, NPROCS=512).
+
+   fixed an output printing format problem occurred when 
+   the number of processes >= 1000.
+
+ - MPI/SP: fixed a performance regression due to improper
+   padding of array dimensions.
+
+ - MPI/IS: minor fix to support large processor counts (>=512).
+
+ - OMP/UA: fixed a race condition in mason.f, avoided the use 
+   of the LASTPRIVATE directive.
+
+ - OMP/LU: minor fix in data flushing for pipelining.
+
+ - DC: There are a number of fixes -
+   * fixed segmentation fault in both OMP and SER versions
+     caused by accessing zero-length array elements.
+     Reported by Jeff Odom <jodom@cs.umd.edu>.
+
+   * fixed a race in reporting benchmark timing in the OMP version
+
+   * fixed the use of timer in the OMP version, which limited
+     the number of threads to 64.  The number of threads is now
+     lifted to a maximum of MAX_NUMBER_OF_TASKS (=256).
+
+   * made the benchmark output consistent with other NPBs.
+
+ - fixed a use of uninitialized variable in MPI/sys/setparams.c.
+   setparams in all three versions was updated to deal with 
+   make.def that contains carriage-return character ('\r').
+
+ - SER/FT: added 'implicit none' to all missing places.
+
+ - SER/IS: fixed missing variable declarations for the Bucket 
+   Sort option (when USE_BUCKETS is defined).
+
+3. Others
+
+ - The default value for collbuf_nodes in the BT I/O benchmark
+   is now set to 0, indicating no file hints will be used.
+   The setting can be changed by using the "inputbt.data" file.
+
+ - The hyperplane version of LU (LU-HP) is no longer included 
+   in the distribution.
+
+
+------------------------------------------------------
+Changes in NPB3.2.1
+      (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI )
+------------------------------------------------------
+[27-Jul-05]
+
+This is a bug fixing release of NPB3.2.
+
+1. MPI version
+  - sys/setparams.c: removed a duplicated statement for writing
+      FT parameters and made invalid SUBTYPE as an error condition.
+      The 'duplicated statement' problem was fixed in NPB3.2 (See 
+      the note below).  However, during the final updating process, 
+      the fix was left out, even though the log file was updated.
+
+  - BT: included SUBTYPE=EPIO in the I/O verification.
+
+  - LU: bcast_inputs.f: fixed wrong data type (dp_type) used for 
+      communicating integers (nx0,ny0,nz0) with the correct type 
+      MPI_INTEGER.
+
+  - MG: fixed a mis-calculation of parameter "nr" in globals.h 
+      that caused run-time failure for NPROCS >= 512 
+      (reported by Donald Ferry of Cray).  Expanded to limit to 
+      131072 processes and added an error checking code.
+
+      The use of MPI_ANY_SOURCE for MPI_Irecv inside subroutine
+      ready() could cause MPI_Wait return a message meant for
+      the wrong k.  The problem is fixed with nbr(axis,-dir,k)
+      in place of MPI_ANY_SOURCE in the call to MPI_Irecv
+      (reported and suggested by Hideo Saito).
+
+2. OpenMP version
+  - EP: use THREADPRIVATE for working array storage. It should not
+      change performance but made some compiler happier.
+
+  - LU: add variable "v" to FLUSH to ensure solution data properly 
+      flushed for pipeline.  This change is needed according to
+      the OpenMP 2.5 standard.
+
+  - IS: reorganized working buffers so that the count for key 
+      population could be more naturally performed.  This version
+      uses much less stack space.
+
+  - UA: implemented atomic updates with locks in order to achieve
+      better scaling on those systems that have an inefficient
+      (or even buggy) ATOMIC implementation.
+
+
+------------------------------------------------------
+Changes in NPB3.2
+      (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI )
+------------------------------------------------------
+[07-Jan-05]
+
+1. DC version in NPB3.2-SER was converted to C from C++
+   (CLASSES S, W, A, B). 
+   sys/setparams.c file was changed appropriately.
+   
+2. OpenMP version of DC was added to NPB3.2-OMP.
+
+3. Data Traffic benchmark DT was added to NPB3.2-MPI.
+
+[24-May-04]
+
+All versions:
+   - use assumed shape "(*)" declaration in CG
+   - fixed the use of an uninitialized variable in EP
+   - avoid using integer array for assumed shape dimensions in FT
+   - fix in UA:
+      * fix the reference to file "inputua.data"
+      * avoid overindexing
+      * avoid reference to out-of-bound array elements
+      * change declaration "real*8" to "double precision"
+
+OMP version:
+   - explicitly added "SCHEDULE(STATIC)" to the OMP version
+   - use the "omp_get_wtime()" function for timer if available
+   - removed the call to "getenv" for portability
+   - change in UA:
+      * implemented an alternative approach for atomic update
+
+MPI version:
+   - removed a duplicated declaration in FT (from setparams.c)
+   - removed a duplicated declaration in BT/full_mpiio.f
+   - fixed a missing "NPROCS=" in sys/suite.awk
+
+
+------------------------------------------------------
+Changes in NPB3.1
+      (NPB3.1-MPI, NPB3.1-SER, NPB3.1-OMP)
+------------------------------------------------------
+[22-Apr-04] NPB3.1-MPI
+
+Merged the NPB2.4-MPI branch into NPB3.1 with the following changes.
+
+  - Optimized the BT memory usage.  The new version is about 1/3 of
+    the memory used in NPB2.x.
+  - Fixed a bug in CG for running on a large number of processes
+  - Redefined the Class W size in MG so that the verification value
+    will not be too small. (see below for SER & OMP versions)
+  - Use the relative errors for verification in both CG and MG
+  - Fixed a race in 'make suite'
+
+[08-Apr-04] NPB3.1-SER and NPB3.1-OMP
+
+The following changes are made in both NPB3.1-SER and NPB3.1-OMP.
+
+1. Added the Class D problem
+   - verification values taken from NPB2.4-MPI
+   - modified variables to fit in large problem
+
+2. Improvements for LU and LU-HP:
+   - reduced the memory usage for the 'tv' variable in LU and LU-HP
+   - a more efficient memory access for variables "a,b,c,d" in LU-HP
+   - a dummy iteration added before the time step loop for consistency
+     with other benchmarks
+
+3. Improvement and fix in MG:
+   - verification in MG now uses the relative error
+     (instead of the absolute error).  This will avoid incorrect
+     verification for small reference values.
+   - redefined the class size for Class W so that the verification
+     value will not be too small.
+     In version 3.0 and earlier: 64x64x64,    40 iters
+     New size in version 3.1   : 128x128x128, 4 iters
+   - fixed incorrect verification values for Classes A and C.
+
+4. CG:
+   - use relative error for verification
+   - clean up codes for matrix initialization (makea).
+     The new code uses about 1/2 memory of the previous version.
+
+5. Fixed makefile related issues
+   - fixed dependence on make.def for files in common.
+   - fixed a race in 'make suite'
+   - added 'LU-HP' as a valid benchmark option in makefiles
+
+The following changes are made in NPB3.1-OMP.
+
+1. Included a hyper-plane version of the LU benchmark: LU-HP
+   - based on the serial version
+
+2. The dummy 'omp_lib_dum' library is not longer used for compilation 
+   without an OpenMP compiler. Conditional compilation is now used.
+
+3. Parallelization of the initialization part of MG.
+   It improves the turn-around time quite a bit for the larger
+   classes, such as class D.
+
+4. Parallelize codes for matrix initialization (makea) in CG.
+   The new code uses about 2/3 memory of the version in NPB3.0-OMP.
+
+5. Code clean up in SP so that the structure is more consistent
+   with the serial version.
+
+
+
+------------------------------------------------------
+Changes in NPB2.x MPI version
+------------------------------------------------------
+
+Changes in 2.4.1
+- fixed error in BT/Makefile (replaced "==" with "=")
+- added stub function accumulate_norms in BT/btio.f
+- changed type of Class B verification constants in BT/verify.f from 
+  single to double precision
+                                                       
+Changes in 2.4
+- Added I/O benchmark (subtype of BT).
+- Added Class D for all benchmarks except IS.
+- Reduced size of tabulated exponentials in FT.
+- Made minor changes to FT to prevent integer overflow for class D on 
+  systems with 32-bit integers. FT class D will not run on small 
+  numbers of processors anymore.
+
+
+------------------------------------------------------
+Changes in non-MPI versions of NPB (previously PBN3.0)
+      (NPB3.0-SER, NPB3.0-HPF, NPB3.0-OMP, NPB3.0-JAV)
+------------------------------------------------------
+
+[01-Mar-99] Initial Beta Release.
+
+[06-Apr-99] Based on report from Charles Grassl and Ramesh Menon (SGI).
+
+   1. NPB-SER, FT: file auxfnct.f -
+      lines 74 and 75 were interchanged:
+
+      double complex u0(d1+1,d2,d3), tmp(maxdim)
+      integer d1,d2,d3
+
+   2. NPB-OMP: The OpenMP standards requires reduction variable be scalars,
+      thus, changes made to remove the use of array variable for reduction.
+      Relevant modifications in EP, CG, LU, SP, and BT
+
+   3. NPB-OMP: Remove compiler warnings of "Referenced scalar variables 
+      use defaults" by declaring explicitly as shared.
+      Relevant modifications in FT, LU, and BT
+
+   4. NPB-OMP, README.openmp: Explicitly spell out the requirement of
+      the static scheduling (setenv OMP_SCHEDULE "static").
+
+
+[05-Oct-99] NPB3.0-non-MPI Beta Release (02)
+
+General change to all (NPB-SER, NPB-HPF, NPB-OMP) -
+   1. Update header information for all benchmarks.
+
+   2. Allow continuation lines in 'make.def' (modification done
+      in sys/setparams.c).
+
+Change made in NPB-OMP -
+   1. 'print_results' now prints Number-Of-Threads and Mflops/s/thread.
+      The printed number is the activated threads during the run, which
+      may not be the same as what's requested.
+
+   2. A initial data touch loop for array A is added in CG.
+
+   3. 'CRITICAL' section is used for reduction with array.
+      Relevant changes in EP, CG, LU, SP, and BT.
+
+   4. Reconfigure 'make.def' such that 'omp_lib_dum' can be activated
+      from the file for no directive compilation.
+
+   5. The "!$OMP END DO" seems needed before "!$OMP MASTER" in rhs.f
+      for both BT and SP for some f90 compilers.
+
+   6. "SCHEDULE(STATIC)" are used for the pipeline in LU to ensure
+      compliance with the OMP standard.
+
+Change made in NPB-HPF -
+   1. 'print_results' now prints Number-Of-Processes and Mflops/s/process.
+
+   2. Use more consistent output format (via print_results).
+
+   3. More consistent makefiles (via config/make.def).
+
+
+[04-Apr-00] NPB3.0-non-MPI Beta Release (03)
+
+Change made in NPB-OMP -
+   1. The OpenMP-C version of IS has been added, including more timers.
+
+   2. 'cprint_results' includes Number-Of-Threads and Mflops/s/thread.
+
+Change made in NPB-SER -
+   1. More timers included in IS.
+
+NPB-JAV has been included in NPB3.0-non-MPI.
+
+
+[31-May-01] NPB3.0-non-MPI Beta Release (04)
+
+Change made in NPB-OMP -
+   1. NPB-OMP/LU: Failure in verification for number of threads greater 
+      than the problem size is now fixed.
+
+   2. If OMP_NUM_THREADS is unset, the printout will report as "unset"
+      instead of "1"
+
+   3. NPB-OMP/IS: Allocating work_buff on the stack seems to cause problem
+      for large problem size (CLASS C).  "work_buff" is now allocated
+      by "malloc" on the heap for CLASS C.
+
+   4. NPB-OMP/IS: Reported by <RaeLyn.Crowell@compaq.com> - potential
+      synchronization problem could arise due to the use of "static"
+      variables inside "randlc()".  Declaration of these static variables
+      are moved out of randlc() and put in the threadprivate directive.
+
+General change to all (NPB-SER, NPB-HPF, NPB-OMP) -
+   1. Cleanup in makefiles
+
+
+[28-Aug-02] The Official NPB3.0 Release
+
+Change made in all -
+   1. Fixed a bogus verification for "NaN".
+
+   2. Name change from "PBN3.0" to "NPB3.0". Updated all the banners.
+
+   3. NPB-SER/FT: use a derived version from NPB2.3-serial.
+
+   4. NPB-HPF/FT: use a consistent printing format.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-HPF.README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-HPF.README
new file mode 100644
index 0000000..ff1e508
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-HPF.README
@@ -0,0 +1,4 @@
+The HPF version of NPB is not included in this distribution.
+Please download it from NPB3.0 instead.
+
+http://www.nas.nasa.gov/Software/NPB
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-JAV.README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-JAV.README
new file mode 100644
index 0000000..b36e686
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-JAV.README
@@ -0,0 +1,4 @@
+The Java version of NPB is not included in this distribution.
+Please download it from NPB3.0 instead.
+
+http://www.nas.nasa.gov/Software/NPB
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/Makefile
new file mode 100644
index 0000000..dd27503
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/Makefile
@@ -0,0 +1,106 @@
+SHELL=/bin/sh
+BENCHMARK=bt
+BENCHMARKU=BT
+VEC=
+
+include ../config/make.def
+
+
+OBJS = bt.o make_set.o initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o \
+       x_solve$(VEC).o y_solve$(VEC).o z_solve$(VEC).o add.o error.o \
+       verify.o setup_mpi.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# header.h: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	@if [ x$(SUBTYPE) = xfull ] ; then	\
+		${MAKE} bt-full;		\
+	elif [ x$(SUBTYPE) = xFULL ] ; then	\
+		${MAKE} bt-full;		\
+	elif [ x$(SUBTYPE) = xsimple ] ; then	\
+		${MAKE} bt-simple;		\
+	elif [ x$(SUBTYPE) = xSIMPLE ] ; then	\
+		${MAKE} bt-simple;		\
+	elif [ x$(SUBTYPE) = xfortran ] ; then	\
+		${MAKE} bt-fortran;		\
+	elif [ x$(SUBTYPE) = xFORTRAN ] ; then	\
+		${MAKE} bt-fortran;		\
+	elif [ x$(SUBTYPE) = xepio ] ; then	\
+		${MAKE} bt-epio;		\
+	elif [ x$(SUBTYPE) = xEPIO ] ; then	\
+		${MAKE} bt-epio;		\
+	else					\
+		${MAKE} bt-bt;			\
+	fi
+
+bt-bt: ${OBJS} btio.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} btio.o ${FMPI_LIB}
+
+bt-full: ${OBJS} full_mpiio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_full ${OBJS} btio_common.o full_mpiio.o ${FMPI_LIB}
+
+bt-simple: ${OBJS} simple_mpiio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_simple ${OBJS} btio_common.o simple_mpiio.o ${FMPI_LIB}
+
+bt-fortran: ${OBJS} fortran_io.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.fortran_io ${OBJS} btio_common.o fortran_io.o ${FMPI_LIB}
+
+bt-epio: ${OBJS} epio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.ep_io ${OBJS} btio_common.o epio.o ${FMPI_LIB}
+
+.f.o:
+	${FCOMPILE} $<
+
+.c.o:
+	${CCOMPILE} $<
+
+
+bt.o:             bt.f  header.h npbparams.h  mpinpb.h
+make_set.o:       make_set.f  header.h npbparams.h  mpinpb.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+define.o:         define.f  header.h npbparams.h
+copy_faces.o:     copy_faces.f  header.h npbparams.h  mpinpb.h
+rhs.o:            rhs.f  header.h npbparams.h
+x_solve$(VEC).o:  x_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h  mpinpb.h
+y_solve$(VEC).o:  y_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h  mpinpb.h
+z_solve$(VEC).o:  z_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h  mpinpb.h
+solve_subs.o:     solve_subs.f  npbparams.h
+add.o:            add.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h  mpinpb.h
+verify.o:         verify.f  header.h npbparams.h  mpinpb.h
+setup_mpi.o:      setup_mpi.f mpinpb.h npbparams.h 
+btio.o:           btio.f  header.h npbparams.h
+btio_common.o:    btio_common.f mpinpb.h npbparams.h 
+fortran_io.o:     fortran_io.f mpinpb.h npbparams.h 
+simple_mpiio.o:   simple_mpiio.f mpinpb.h npbparams.h 
+full_mpiio.o:     full_mpiio.f mpinpb.h npbparams.h 
+epio.o:           epio.f mpinpb.h npbparams.h 
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f  npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/add.f
new file mode 100644
index 0000000..e14cde4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/add.f
@@ -0,0 +1,30 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     addition of update to the vector u
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer  c, i, j, k, m
+
+      do     c = 1, ncells
+         do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do    m = 1, 5
+                     u(m,i,j,k,c) = u(m,i,j,k,c) + rhs(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/adi.f
new file mode 100644
index 0000000..58450c0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/adi.f
@@ -0,0 +1,21 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      call copy_faces
+
+      call x_solve
+
+      call y_solve
+
+      call z_solve
+
+      call add
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/bt.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/bt.f
new file mode 100644
index 0000000..c64d0a3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/bt.f
@@ -0,0 +1,328 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   B T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007.          !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: R. F. Van der Wijngaart
+c          T. Harris
+c          M. Yarrow
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program MPBT
+c---------------------------------------------------------------------
+
+       include  'header.h'
+       include  'mpinpb.h'
+      
+       integer i, niter, step, c, error, fstatus
+       double precision navg, mflops, mbytes, n3
+
+       external timer_read
+       double precision t, tmax, tiominv, tpc, timer_read
+       logical verified
+       character class, cbuff*40
+       double precision t1(t_last+2), tsum(t_last+2), 
+     >                  tming(t_last+2), tmaxg(t_last+2)
+       character        t_recs(t_last+2)*8
+
+       integer wr_interval
+
+       data t_recs/'total', 'i/o', 'rhs', 'xsolve', 'ysolve', 'zsolve', 
+     >             'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm',
+     >             ' totcomp', ' totcomm'/
+
+       call setup_mpi
+       if (.not. active) goto 999
+
+c---------------------------------------------------------------------
+c      Root node reads input file (if it exists) else takes
+c      defaults from parameters
+c---------------------------------------------------------------------
+       if (node .eq. root) then
+          
+          write(*, 1000)
+
+          open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+          timeron = .false.
+          if (fstatus .eq. 0) then
+             timeron = .true.
+             close(2)
+          endif
+
+          open (unit=2,file='inputbt.data',status='old', iostat=fstatus)
+c
+          rd_interval = 0
+          if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputbt.data')
+            read (2,*) niter
+            read (2,*) dt
+            read (2,*) grid_points(1), grid_points(2), grid_points(3)
+            if (iotype .ne. 0) then
+                read (2,'(A)') cbuff
+                read (cbuff,*,iostat=i) wr_interval, rd_interval
+                if (i .ne. 0) rd_interval = 0
+                if (wr_interval .le. 0) wr_interval = wr_default
+            endif
+            if (iotype .eq. 1) then
+                read (2,*) collbuf_nodes, collbuf_size
+                write(*,*) 'collbuf_nodes ', collbuf_nodes
+                write(*,*) 'collbuf_size  ', collbuf_size
+            endif
+            close(2)
+          else
+            write(*,234) 
+            niter = niter_default
+            dt    = dt_default
+            grid_points(1) = problem_size
+            grid_points(2) = problem_size
+            grid_points(3) = problem_size
+            wr_interval = wr_default
+            if (iotype .eq. 1) then
+c             set number of nodes involved in collective buffering to 4,
+c             unless total number of nodes is smaller than that.
+c             set buffer size for collective buffering to 1MB per node
+c             collbuf_nodes = min(4,no_nodes)
+c             set default to No-File-Hints with a value of 0
+              collbuf_nodes = 0
+              collbuf_size = 1000000
+            endif
+          endif
+ 234      format(' No input file inputbt.data. Using compiled defaults')
+
+          write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+          write(*, 1002) niter, dt
+          if (no_nodes .ne. total_nodes) write(*, 1004) total_nodes
+          if (no_nodes .ne. maxcells*maxcells) 
+     >        write(*, 1005) maxcells*maxcells
+          write(*, 1003) no_nodes
+
+          if (iotype .eq. 1) write(*, 1006) 'FULL MPI-IO', wr_interval
+          if (iotype .eq. 2) write(*, 1006) 'SIMPLE MPI-IO', wr_interval
+          if (iotype .eq. 3) write(*, 1006) 'EPIO', wr_interval
+          if (iotype .eq. 4) write(*, 1006) 'FORTRAN IO', wr_interval
+
+ 1000 format(//, ' NAS Parallel Benchmarks 3.3 -- BT Benchmark ',/)
+ 1001     format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002     format(' Iterations: ', i4, '    dt: ', F11.7)
+ 1004     format(' Total number of processes: ', i5)
+ 1005     format(' WARNING: compiled for ', i5, ' processes ')
+ 1003     format(' Number of active processes: ', i5, /)
+ 1006     format(' BTIO -- ', A, ' write interval: ', i3 /)
+
+       endif
+
+       call mpi_bcast(niter, 1, MPI_INTEGER,
+     >                root, comm_setup, error)
+
+       call mpi_bcast(dt, 1, dp_type, 
+     >                root, comm_setup, error)
+
+       call mpi_bcast(grid_points(1), 3, MPI_INTEGER, 
+     >                root, comm_setup, error)
+
+       call mpi_bcast(wr_interval, 1, MPI_INTEGER,
+     >                root, comm_setup, error)
+
+       call mpi_bcast(rd_interval, 1, MPI_INTEGER,
+     >                root, comm_setup, error)
+
+       call mpi_bcast(timeron, 1, MPI_LOGICAL, 
+     >                root, comm_setup, error)
+
+       call make_set
+
+       do  c = 1, maxcells
+          if ( (cell_size(1,c) .gt. IMAX) .or.
+     >         (cell_size(2,c) .gt. JMAX) .or.
+     >         (cell_size(3,c) .gt. KMAX) ) then
+             print *,node, c, (cell_size(i,c),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+          endif
+       end do
+
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call set_constants
+
+       call initialize
+
+       call setup_btio
+       idump = 0
+
+       call lhsinit
+
+       call exact_rhs
+
+       call compute_buffer_size(5)
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+c---------------------------------------------------------------------
+c      Synchronize before placing time stamp
+c---------------------------------------------------------------------
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+       call mpi_barrier(comm_setup, error)
+
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (node .eq. root) then
+             if (mod(step, 20) .eq. 0 .or. step .eq. niter .or.
+     >           step .eq. 1) then
+                write(*, 200) step
+ 200            format(' Time step ', i4)
+             endif
+          endif
+
+          call adi
+
+          if (iotype .ne. 0) then
+              if (mod(step, wr_interval).eq.0 .or. step .eq. niter) then
+                  if (node .eq. root) then
+                      print *, 'Writing data set, time step', step
+                  endif
+                  if (step .eq. niter .and. rd_interval .gt. 1) then
+                      rd_interval = 1
+                  endif
+                  call timer_start(2)
+                  call output_timestep
+                  call timer_stop(2)
+                  idump = idump + 1
+              endif
+          endif
+       end do
+
+       call timer_start(2)
+       call btio_cleanup
+       call timer_stop(2)
+
+       call timer_stop(1)
+       t = timer_read(1)
+
+       call verify(niter, class, verified)
+
+       call mpi_reduce(t, tmax, 1, 
+     >                 dp_type, MPI_MAX, 
+     >                 root, comm_setup, error)
+
+       if (iotype .ne. 0) then
+          t = timer_read(2)
+          if (t .ne. 0.d0) t = 1.0d0 / t
+          call mpi_reduce(t, tiominv, 1, 
+     >                    dp_type, MPI_SUM, 
+     >                    root, comm_setup, error)
+       endif
+
+       if( node .eq. root ) then
+          n3 = 1.0d0*grid_points(1)*grid_points(2)*grid_points(3)
+          navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.0
+          if( tmax .ne. 0. ) then
+             mflops = 1.0e-6*float(niter)*
+     >     (3478.8*n3-17655.7*navg**2+28023.7*navg)
+     >     / tmax
+          else
+             mflops = 0.0
+          endif
+
+          if (iotype .ne. 0) then
+             mbytes = n3 * 40.0 * idump * 1.0d-6
+             tiominv = tiominv / no_nodes
+             t = 0.0
+             if (tiominv .ne. 0.) t = 1.d0 / tiominv
+             tpc = 0.0
+             if (tmax .ne. 0.) tpc = t * 100.0 / tmax
+             write(*,1100) t, tpc, mbytes, mbytes*tiominv
+ 1100        format(/' BTIO -- statistics:'/
+     >               '   I/O timing in seconds   : ', f14.2/
+     >               '   I/O timing percentage   : ', f14.2/
+     >               '   Total data written (MB) : ', f14.2/
+     >               '   I/O data rate  (MB/sec) : ', f14.2)
+          endif
+
+         call print_results('BT', class, grid_points(1), 
+     >     grid_points(2), grid_points(3), niter, maxcells*maxcells, 
+     >     total_nodes, tmax, mflops, '          floating point', 
+     >     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >     cs6, '(none)')
+       endif
+
+       if (.not.timeron) goto 999
+
+       do i = 1, t_last
+          t1(i) = timer_read(i)
+       end do
+       t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm)
+       t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm)
+       t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm)
+       t1(t_last+2) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch)
+       t1(t_last+1) = t1(t_total)  - t1(t_last+2)
+
+       call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                 0, comm_setup, error)
+       call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, 
+     >                 0, comm_setup, error)
+       call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                 0, comm_setup, error)
+
+       if (node .eq. 0) then
+          write(*, 800) total_nodes
+          do i = 1, t_last+2
+             tsum(i) = tsum(i) / total_nodes
+             write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+          end do
+       endif
+ 800   format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >        5x, 'average')
+ 810   format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999   continue
+       call mpi_barrier(MPI_COMM_WORLD, error)
+       call mpi_finalize(error)
+
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio.f
new file mode 100644
index 0000000..1fb730b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio.f
@@ -0,0 +1,72 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_verify(verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      logical verified
+
+      verified = .true.
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision xce_acc(5)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine checksum_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio_common.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio_common.f
new file mode 100644
index 0000000..9227a12
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/btio_common.f
@@ -0,0 +1,30 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine clear_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer cio, kio, jio, ix
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  do ix=0,cell_size(1,cio)-1
+                            u(1,ix, jio,kio,cio) = 0
+                            u(2,ix, jio,kio,cio) = 0
+                            u(3,ix, jio,kio,cio) = 0
+                            u(4,ix, jio,kio,cio) = 0
+                            u(5,ix, jio,kio,cio) = 0
+                  enddo
+              enddo
+          enddo
+      enddo
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/copy_faces.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/copy_faces.f
new file mode 100644
index 0000000..5261d30
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/copy_faces.f
@@ -0,0 +1,322 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine copy_faces
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c This function copies the face values of a variable defined on a set 
+c of cells to the overlap locations of the adjacent sets of cells. 
+c Because a set of cells interfaces in each direction with exactly one 
+c other set, we only need to fill six different buffers. We could try to 
+c overlap communication with computation, by computing
+c some internal values while communicating boundary values, but this
+c adds so much overhead that it's not clearly useful. 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i, j, k, c, m, requests(0:11), p0, p1, 
+     >     p2, p3, p4, p5, b_size(0:5), ss(0:5), 
+     >     sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11)
+
+c---------------------------------------------------------------------
+c     exit immediately if there are no faces to be copied           
+c---------------------------------------------------------------------
+      if (no_nodes .eq. 1) then
+         call compute_rhs
+         return
+      endif
+
+      ss(0) = start_send_east
+      ss(1) = start_send_west
+      ss(2) = start_send_north
+      ss(3) = start_send_south
+      ss(4) = start_send_top
+      ss(5) = start_send_bottom
+
+      sr(0) = start_recv_east
+      sr(1) = start_recv_west
+      sr(2) = start_recv_north
+      sr(3) = start_recv_south
+      sr(4) = start_recv_top
+      sr(5) = start_recv_bottom
+
+      b_size(0) = east_size   
+      b_size(1) = west_size   
+      b_size(2) = north_size  
+      b_size(3) = south_size  
+      b_size(4) = top_size    
+      b_size(5) = bottom_size 
+
+c---------------------------------------------------------------------
+c     because the difference stencil for the diagonalized scheme is 
+c     orthogonal, we do not have to perform the staged copying of faces, 
+c     but can send all face information simultaneously to the neighboring 
+c     cells in all directions          
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_bpack)
+      p0 = 0
+      p1 = 0
+      p2 = 0
+      p3 = 0
+      p4 = 0
+      p5 = 0
+
+      do  c = 1, ncells
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to eastern neighbors (i-dir)
+c---------------------------------------------------------------------
+         if (cell_coord(1,c) .ne. ncells) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = cell_size(1,c)-2, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(0)+p0) = u(m,i,j,k,c)
+                        p0 = p0 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to western neighbors 
+c---------------------------------------------------------------------
+         if (cell_coord(1,c) .ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, 1
+                     do   m = 1, 5
+                        out_buffer(ss(1)+p1) = u(m,i,j,k,c)
+                        p1 = p1 + 1
+                     end do
+                  end do
+               end do
+            end do
+
+         endif
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to northern neighbors (j_dir)
+c---------------------------------------------------------------------
+         if (cell_coord(2,c) .ne. ncells) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = cell_size(2,c)-2, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(2)+p2) = u(m,i,j,k,c)
+                        p2 = p2 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to southern neighbors 
+c---------------------------------------------------------------------
+         if (cell_coord(2,c).ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, 1
+                  do   i = 0, cell_size(1,c)-1   
+                     do   m = 1, 5
+                        out_buffer(ss(3)+p3) = u(m,i,j,k,c)
+                        p3 = p3 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to top neighbors (k-dir)
+c---------------------------------------------------------------------
+         if (cell_coord(3,c) .ne. ncells) then
+            do   k = cell_size(3,c)-2, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(4)+p4) = u(m,i,j,k,c)
+                        p4 = p4 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     fill the buffer to be sent to bottom neighbors
+c---------------------------------------------------------------------
+         if (cell_coord(3,c).ne. 1) then
+            do    k=0, 1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(5)+p5) = u(m,i,j,k,c)
+                        p5 = p5 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     cell loop
+c---------------------------------------------------------------------
+      end do
+      if (timeron) call timer_stop(t_bpack)
+
+      if (timeron) call timer_start(t_exch)
+      call mpi_irecv(in_buffer(sr(0)), b_size(0), 
+     >     dp_type, successor(1), WEST,  
+     >     comm_rhs, requests(0), error)
+      call mpi_irecv(in_buffer(sr(1)), b_size(1), 
+     >     dp_type, predecessor(1), EAST,  
+     >     comm_rhs, requests(1), error)
+      call mpi_irecv(in_buffer(sr(2)), b_size(2), 
+     >     dp_type, successor(2), SOUTH, 
+     >     comm_rhs, requests(2), error)
+      call mpi_irecv(in_buffer(sr(3)), b_size(3), 
+     >     dp_type, predecessor(2), NORTH, 
+     >     comm_rhs, requests(3), error)
+      call mpi_irecv(in_buffer(sr(4)), b_size(4), 
+     >     dp_type, successor(3), BOTTOM,
+     >     comm_rhs, requests(4), error)
+      call mpi_irecv(in_buffer(sr(5)), b_size(5), 
+     >     dp_type, predecessor(3), TOP,   
+     >     comm_rhs, requests(5), error)
+
+      call mpi_isend(out_buffer(ss(0)), b_size(0), 
+     >     dp_type, successor(1),   EAST, 
+     >     comm_rhs, requests(6), error)
+      call mpi_isend(out_buffer(ss(1)), b_size(1), 
+     >     dp_type, predecessor(1), WEST, 
+     >     comm_rhs, requests(7), error)
+      call mpi_isend(out_buffer(ss(2)), b_size(2), 
+     >     dp_type,successor(2),   NORTH, 
+     >     comm_rhs, requests(8), error)
+      call mpi_isend(out_buffer(ss(3)), b_size(3), 
+     >     dp_type,predecessor(2), SOUTH, 
+     >     comm_rhs, requests(9), error)
+      call mpi_isend(out_buffer(ss(4)), b_size(4), 
+     >     dp_type,successor(3),   TOP, 
+     >     comm_rhs,   requests(10), error)
+      call mpi_isend(out_buffer(ss(5)), b_size(5), 
+     >     dp_type,predecessor(3), BOTTOM, 
+     >     comm_rhs,requests(11), error)
+
+
+      call mpi_waitall(12, requests, statuses, error)
+      if (timeron) call timer_stop(t_exch)
+
+c---------------------------------------------------------------------
+c     unpack the data that has just been received;             
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_bpack)
+      p0 = 0
+      p1 = 0
+      p2 = 0
+      p3 = 0
+      p4 = 0
+      p5 = 0
+
+      do   c = 1, ncells
+
+         if (cell_coord(1,c) .ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = -2, -1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(1)+p0)
+                        p0 = p0 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(1,c) .ne. ncells) then
+            do  k = 0, cell_size(3,c)-1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = cell_size(1,c), cell_size(1,c)+1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(0)+p1)
+                        p1 = p1 + 1
+                     end do
+                  end do
+               end do
+            end do
+         end if
+            
+         if (cell_coord(2,c) .ne. 1) then
+            do  k = 0, cell_size(3,c)-1
+               do   j = -2, -1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(3)+p2)
+                        p2 = p2 + 1
+                     end do
+                  end do
+               end do
+            end do
+
+         endif
+            
+         if (cell_coord(2,c) .ne. ncells) then
+            do  k = 0, cell_size(3,c)-1
+               do   j = cell_size(2,c), cell_size(2,c)+1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(2)+p3)
+                        p3 = p3 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(3,c) .ne. 1) then
+            do  k = -2, -1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(5)+p4)
+                        p4 = p4 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(3,c) .ne. ncells) then
+            do  k = cell_size(3,c), cell_size(3,c)+1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(4)+p5)
+                        p5 = p5 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+c---------------------------------------------------------------------
+c     cells loop
+c---------------------------------------------------------------------
+      end do
+      if (timeron) call timer_stop(t_bpack)
+
+c---------------------------------------------------------------------
+c     do the rest of the rhs that uses the copied face values          
+c---------------------------------------------------------------------
+      call compute_rhs
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/define.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/define.f
new file mode 100644
index 0000000..03c4c6e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/define.f
@@ -0,0 +1,64 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_buffer_size(dim)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer  c, dim, face_size
+
+      if (ncells .eq. 1) return
+
+c---------------------------------------------------------------------
+c     compute the actual sizes of the buffers; note that there is 
+c     always one cell face that doesn't need buffer space, because it 
+c     is at the boundary of the grid
+c---------------------------------------------------------------------
+      west_size = 0
+      east_size = 0
+
+      do   c = 1, ncells
+         face_size = cell_size(2,c) * cell_size(3,c) * dim * 2
+         if (cell_coord(1,c).ne.1) west_size = west_size + face_size
+         if (cell_coord(1,c).ne.ncells) east_size = east_size + 
+     >        face_size 
+      end do
+
+      north_size = 0
+      south_size = 0
+      do   c = 1, ncells
+         face_size = cell_size(1,c)*cell_size(3,c) * dim * 2
+         if (cell_coord(2,c).ne.1) south_size = south_size + face_size
+         if (cell_coord(2,c).ne.ncells) north_size = north_size + 
+     >        face_size 
+      end do
+
+      top_size = 0
+      bottom_size = 0
+      do   c = 1, ncells
+         face_size = cell_size(1,c) * cell_size(2,c) * dim * 2
+         if (cell_coord(3,c).ne.1) bottom_size = bottom_size + 
+     >        face_size
+         if (cell_coord(3,c).ne.ncells) top_size = top_size +
+     >        face_size     
+      end do
+
+      start_send_west   = 1
+      start_send_east   = start_send_west   + west_size
+      start_send_south  = start_send_east   + east_size
+      start_send_north  = start_send_south  + south_size
+      start_send_bottom = start_send_north  + north_size
+      start_send_top    = start_send_bottom + bottom_size
+      start_recv_west   = 1
+      start_recv_east   = start_recv_west   + west_size
+      start_recv_south  = start_recv_east   + east_size
+      start_recv_north  = start_recv_south  + south_size
+      start_recv_bottom = start_recv_north  + north_size
+      start_recv_top    = start_recv_bottom + bottom_size
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/epio.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/epio.f
new file mode 100644
index 0000000..52b6309
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/epio.f
@@ -0,0 +1,165 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      character*(128) newfilenm
+      integer m
+
+      if (node .lt. 10000) then
+          write (newfilenm, 996) filenm,node
+      else
+          print *, 'error generating file names (> 10000 nodes)'
+          stop
+      endif
+
+996   format (a,'.',i4.4)
+
+      open (unit=99, file=newfilenm, form='unformatted',
+     $       status='unknown')
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ix, iio, jio, kio, cio, aio
+
+      do cio=1,ncells
+          write(99)
+     $         ((((u(aio,ix, jio,kio,cio),aio=1,5),
+     $             ix=0, cell_size(1,cio)-1),
+     $             jio=0, cell_size(2,cio)-1),
+     $             kio=0, cell_size(3,cio)-1)
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            rewind(99)
+            call acc_sub_norms(idump+1)
+
+            rewind(99)
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer idump_cur
+
+      integer ix, jio, kio, cio, ii, m, ichunk
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          read(99)
+     $         ((((u(m,ix, jio,kio,cio),m=1,5),
+     $             ix=0, cell_size(1,cio)-1),
+     $             jio=0, cell_size(2,cio)-1),
+     $             kio=0, cell_size(3,cio)-1)
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      close(unit=99)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      double precision xce_acc(5)
+
+      character*(128) newfilenm
+      integer m
+
+      if (rd_interval .gt. 0) goto 20
+
+      if (node .lt. 10000) then
+          write (newfilenm, 996) filenm,node
+      else
+          print *, 'error generating file names (> 10000 nodes)'
+          stop
+      endif
+
+996   format (a,'.',i4.4)
+
+      open (unit=99, file=newfilenm,
+     $      form='unformatted')
+
+c     clear the last time step
+
+      call clear_timestep
+
+c     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      close(unit=99)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/error.f
new file mode 100644
index 0000000..147a582
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/error.f
@@ -0,0 +1,106 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function computes the norm of the difference between the
+c     computed solution and the exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer c, i, j, k, m, ii, jj, kk, d, error
+      double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5),
+     >     add
+
+      do m = 1, 5 
+         rms_work(m) = 0.0d0
+      enddo
+
+      do c = 1, ncells
+         kk = 0
+         do k = cell_low(3,c), cell_high(3,c)
+            zeta = dble(k) * dnzm1
+            jj = 0
+            do j = cell_low(2,c), cell_high(2,c)
+               eta = dble(j) * dnym1
+               ii = 0
+               do i = cell_low(1,c), cell_high(1,c)
+                  xi = dble(i) * dnxm1
+                  call exact_solution(xi, eta, zeta, u_exact)
+
+                  do m = 1, 5
+                     add = u(m,ii,jj,kk,c)-u_exact(m)
+                     rms_work(m) = rms_work(m) + add*add
+                  enddo
+                  ii = ii + 1
+               enddo
+               jj = jj + 1
+            enddo
+            kk = kk + 1
+         enddo
+      enddo
+
+      call mpi_allreduce(rms_work, rms, 5, dp_type, 
+     >     MPI_SUM, comm_setup, error)
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo
+         rms(m) = dsqrt(rms(m))
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer c, i, j, k, d, m, error
+      double precision rms(5), rms_work(5), add
+
+      do m = 1, 5
+         rms_work(m) = 0.0d0
+      enddo 
+
+      do c = 1, ncells
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     add = rhs(m,i,j,k,c)
+                     rms_work(m) = rms_work(m) + add*add
+                  enddo 
+               enddo 
+            enddo 
+         enddo 
+      enddo 
+
+      call mpi_allreduce(rms_work, rms, 5, dp_type, 
+     >     MPI_SUM, comm_setup, error)
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo 
+         rms(m) = dsqrt(rms(m))
+      enddo 
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_rhs.f
new file mode 100644
index 0000000..26a2871
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_rhs.f
@@ -0,0 +1,360 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dtemp(5), xi, eta, zeta, dtpp
+      integer          c, m, i, j, k, ip1, im1, jp1, 
+     >     jm1, km1, kp1
+
+
+c---------------------------------------------------------------------
+c     loop over all cells owned by this node                   
+c---------------------------------------------------------------------
+      do c = 1, ncells
+
+c---------------------------------------------------------------------
+c     initialize                                  
+c---------------------------------------------------------------------
+         do k= 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = 0.0d0
+                  enddo
+               enddo
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     xi-direction flux differences                      
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            zeta = dble(k+cell_low(3,c)) * dnzm1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               eta = dble(j+cell_low(2,c)) * dnym1
+
+               do i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c)
+                  xi = dble(i+cell_low(1,c)) * dnxm1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5
+                     ue(i,m) = dtemp(m)
+                  enddo
+
+                  dtpp = 1.0d0 / dtemp(1)
+
+                  do m = 2, 5
+                     buf(i,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(i)   = buf(i,2) * buf(i,2)
+                  buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                 buf(i,4) * buf(i,4) 
+                  q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                 buf(i,4)*ue(i,4))
+
+               enddo
+               
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  im1 = i-1
+                  ip1 = i+1
+
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tx2 * (
+     >                 (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                 (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                 buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+               enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                         
+c---------------------------------------------------------------------
+               if (start(1,c) .gt. 0) then
+                  do m = 1, 5
+                     i = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                     i = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                    4.0d0*ue(i+1,m) +       ue(i+2,m))
+                  enddo
+               endif
+
+               do i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                  enddo
+               enddo
+
+               if (end(1,c) .gt. 0) then
+                  do m = 1, 5
+                     i = cell_size(1,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                     i = cell_size(1,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     eta-direction flux differences             
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1          
+            zeta = dble(k+cell_low(3,c)) * dnzm1
+            do i=start(1,c), cell_size(1,c)-end(1,c)-1
+               xi = dble(i+cell_low(1,c)) * dnxm1
+
+               do j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c)
+                  eta = dble(j+cell_low(2,c)) * dnym1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5 
+                     ue(j,m) = dtemp(m)
+                  enddo
+                  
+                  dtpp = 1.0d0/dtemp(1)
+
+                  do m = 2, 5
+                     buf(j,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(j)   = buf(j,3) * buf(j,3)
+                  buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                 buf(j,4) * buf(j,4)
+                  q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                 buf(j,4)*ue(j,4))
+               enddo
+
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  jm1 = j-1
+                  jp1 = j+1
+                  
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -
+     >                 ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                 dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - ty2*(
+     >                 ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                 dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - ty2*(
+     >                 (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                 (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                 yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                 dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - ty2*(
+     >                 ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                 dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - ty2*(
+     >                 buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                 buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                 buf(jm1,1))+
+     >                 yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                 yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                 dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+               enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                      
+c---------------------------------------------------------------------
+               if (start(2,c) .gt. 0) then
+                  do m = 1, 5
+                     j = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                     j = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                    4.0d0*ue(j+1,m) +       ue(j+2,m))
+                  enddo
+               endif
+
+               do j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                  enddo
+               enddo
+
+               if (end(2,c) .gt. 0) then
+                  do m = 1, 5
+                     j = cell_size(2,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                     j = cell_size(2,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zeta-direction flux differences                      
+c---------------------------------------------------------------------
+         do j=start(2,c), cell_size(2,c)-end(2,c)-1
+            eta = dble(j+cell_low(2,c)) * dnym1
+            do i = start(1,c), cell_size(1,c)-end(1,c)-1
+               xi = dble(i+cell_low(1,c)) * dnxm1
+
+               do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c)
+                  zeta = dble(k+cell_low(3,c)) * dnzm1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5
+                     ue(k,m) = dtemp(m)
+                  enddo
+
+                  dtpp = 1.0d0/dtemp(1)
+
+                  do m = 2, 5
+                     buf(k,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(k)   = buf(k,4) * buf(k,4)
+                  buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + 
+     >                 buf(k,3) * buf(k,3)
+                  q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                 buf(k,4)*ue(k,4))
+               enddo
+
+               do k=start(3,c), cell_size(3,c)-end(3,c)-1
+                  km1 = k-1
+                  kp1 = k+1
+                  
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tz2 * (
+     >                 (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                 (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                 zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                 dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                 +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+               enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                        
+c---------------------------------------------------------------------
+               if (start(3,c) .gt. 0) then
+                  do m = 1, 5
+                     k = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                     k = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                    4.0d0*ue(k+1,m) +       ue(k+2,m))
+                  enddo
+               endif
+
+               do k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                  enddo
+               enddo
+
+               if (end(3,c) .gt. 0) then
+                  do m = 1, 5
+                     k = cell_size(3,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                     k = cell_size(3,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = -1.d0 * forcing(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_solution.f
new file mode 100644
index 0000000..b093b46
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/exact_solution.f
@@ -0,0 +1,29 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision  xi, eta, zeta, dtemp(5)
+      integer m
+
+      do m = 1, 5
+         dtemp(m) =  ce(m,1) +
+     >     xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >     eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >     zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >     zeta*ce(m,13))))
+      enddo
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/fortran_io.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/fortran_io.f
new file mode 100644
index 0000000..d3085a0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/fortran_io.f
@@ -0,0 +1,174 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      character*(128) newfilenm
+      integer m, ierr
+
+      if (node.eq.root) record_length = 40/fortran_rec_sz
+      call mpi_bcast(record_length, 1, MPI_INTEGER,
+     >                root, comm_setup, ierr)
+
+      open (unit=99, file=filenm,
+     $      form='unformatted', access='direct',
+     $      recl=record_length)
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ix, jio, kio, cio
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(1,cio) +
+     $                   PROBLEM_SIZE*((cell_low(2,cio)+jio) +
+     $                   PROBLEM_SIZE*((cell_low(3,cio)+kio) +
+     $                   PROBLEM_SIZE*idump_sub)))
+
+                  do ix=0,cell_size(1,cio)-1
+                      write(99, rec=iseek+ix+1)
+     $                      u(1,ix, jio,kio,cio),
+     $                      u(2,ix, jio,kio,cio),
+     $                      u(3,ix, jio,kio,cio),
+     $                      u(4,ix, jio,kio,cio),
+     $                      u(5,ix, jio,kio,cio)
+                  enddo
+              enddo
+          enddo
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            call acc_sub_norms(idump+1)
+
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer idump_cur
+
+      integer ix, jio, kio, cio, ii, m, ichunk
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(1,cio) +
+     $                   PROBLEM_SIZE*((cell_low(2,cio)+jio) +
+     $                   PROBLEM_SIZE*((cell_low(3,cio)+kio) +
+     $                   PROBLEM_SIZE*ii)))
+
+
+                  do ix=0,cell_size(1,cio)-1
+                      read(99, rec=iseek+ix+1)
+     $                      u(1,ix, jio,kio,cio),
+     $                      u(2,ix, jio,kio,cio),
+     $                      u(3,ix, jio,kio,cio),
+     $                      u(4,ix, jio,kio,cio),
+     $                      u(5,ix, jio,kio,cio)
+                  enddo
+              enddo
+          enddo
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      close(unit=99)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      double precision xce_acc(5)
+      integer m
+
+      if (rd_interval .gt. 0) goto 20
+
+      open (unit=99, file=filenm,
+     $      form='unformatted', access='direct',
+     $      recl=record_length)
+
+c     clear the last time step
+
+      call clear_timestep
+
+c     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      close(unit=99)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/full_mpiio.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/full_mpiio.f
new file mode 100644
index 0000000..ecfd41c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/full_mpiio.f
@@ -0,0 +1,307 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      integer sizes(4), starts(4), subsizes(4)
+      integer cell_btype(maxcells), cell_ftype(maxcells)
+      integer cell_blength(maxcells)
+      integer info
+      character*20 cb_nodes, cb_size
+      integer c, m
+      integer cell_disp(maxcells)
+
+       call mpi_bcast(collbuf_nodes, 1, MPI_INTEGER,
+     >                root, comm_setup, ierr)
+
+       call mpi_bcast(collbuf_size, 1, MPI_INTEGER,
+     >                root, comm_setup, ierr)
+
+       if (collbuf_nodes .eq. 0) then
+          info = MPI_INFO_NULL
+       else
+          write (cb_nodes,*) collbuf_nodes
+          write (cb_size,*) collbuf_size
+          call MPI_Info_create(info, ierr)
+          call MPI_Info_set(info, 'cb_nodes', cb_nodes, ierr)
+          call MPI_Info_set(info, 'cb_buffer_size', cb_size, ierr)
+          call MPI_Info_set(info, 'collective_buffering', 'true', ierr)
+       endif
+
+       call MPI_Type_contiguous(5, MPI_DOUBLE_PRECISION,
+     $                          element, ierr)
+       call MPI_Type_commit(element, ierr)
+       call MPI_Type_extent(element, eltext, ierr)
+
+       do  c = 1, ncells
+c
+c Outer array dimensions ar same for every cell
+c
+           sizes(1) = IMAX+4
+           sizes(2) = JMAX+4
+           sizes(3) = KMAX+4
+c
+c 4th dimension is cell number, total of maxcells cells
+c
+           sizes(4) = maxcells
+c
+c Internal dimensions of cells can differ slightly between cells
+c
+           subsizes(1) = cell_size(1, c)
+           subsizes(2) = cell_size(2, c)
+           subsizes(3) = cell_size(3, c)
+c
+c Cell is 4th dimension, 1 cell per cell type to handle varying 
+c cell sub-array sizes
+c
+           subsizes(4) = 1
+
+c
+c type constructors use 0-based start addresses
+c
+           starts(1) = 2 
+           starts(2) = 2
+           starts(3) = 2
+           starts(4) = c-1
+
+c 
+c Create buftype for a cell
+c
+           call MPI_Type_create_subarray(4, sizes, subsizes, 
+     $          starts, MPI_ORDER_FORTRAN, element, 
+     $          cell_btype(c), ierr)
+c
+c block length and displacement for joining cells - 
+c 1 cell buftype per block, cell buftypes have own displacment
+c generated from cell number (4th array dimension)
+c
+           cell_blength(c) = 1
+           cell_disp(c) = 0
+
+       enddo
+c
+c Create combined buftype for all cells
+c
+       call MPI_Type_struct(ncells, cell_blength, cell_disp,
+     $            cell_btype, combined_btype, ierr)
+       call MPI_Type_commit(combined_btype, ierr)
+
+       do  c = 1, ncells
+c
+c Entire array size
+c
+           sizes(1) = PROBLEM_SIZE
+           sizes(2) = PROBLEM_SIZE
+           sizes(3) = PROBLEM_SIZE
+
+c
+c Size of c'th cell
+c
+           subsizes(1) = cell_size(1, c)
+           subsizes(2) = cell_size(2, c)
+           subsizes(3) = cell_size(3, c)
+
+c
+c Starting point in full array of c'th cell
+c
+           starts(1) = cell_low(1,c)
+           starts(2) = cell_low(2,c)
+           starts(3) = cell_low(3,c)
+
+           call MPI_Type_create_subarray(3, sizes, subsizes,
+     $          starts, MPI_ORDER_FORTRAN,
+     $          element, cell_ftype(c), ierr)
+           cell_blength(c) = 1
+           cell_disp(c) = 0
+       enddo
+
+       call MPI_Type_struct(ncells, cell_blength, cell_disp,
+     $            cell_ftype, combined_ftype, ierr)
+       call MPI_Type_commit(combined_ftype, ierr)
+
+       iseek=0
+       if (node .eq. root) then
+          call MPI_File_delete(filenm, MPI_INFO_NULL, ierr)
+       endif
+
+
+      call MPI_Barrier(comm_solve, ierr)
+
+       call MPI_File_open(comm_solve,
+     $          filenm,
+     $          MPI_MODE_RDWR+MPI_MODE_CREATE,
+     $          MPI_INFO_NULL, fp, ierr)
+
+       if (ierr .ne. MPI_SUCCESS) then
+                print *, 'Error opening file'
+                stop
+       endif
+
+        call MPI_File_set_view(fp, iseek, element, 
+     $          combined_ftype, 'native', info, ierr)
+
+       if (ierr .ne. MPI_SUCCESS) then
+                print *, 'Error setting file view'
+                stop
+       endif
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer mstatus(MPI_STATUS_SIZE)
+      integer ierr
+
+      call MPI_File_write_at_all(fp, iseek, u,
+     $                           1, combined_btype, mstatus, ierr)
+      if (ierr .ne. MPI_SUCCESS) then
+          print *, 'Error writing to file'
+          stop
+      endif
+
+      call MPI_Type_size(combined_btype, iosize, ierr)
+      iseek = iseek + iosize/eltext
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            iseek = 0
+            call acc_sub_norms(idump+1)
+
+            iseek = 0
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer idump_cur
+
+      integer ii, m, ichunk
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+
+        call MPI_File_read_at_all(fp, iseek, u,
+     $                           1, combined_btype, mstatus, ierr)
+        if (ierr .ne. MPI_SUCCESS) then
+           print *, 'Error reading back file'
+           call MPI_File_close(fp, ierr)
+           stop
+        endif
+
+        call MPI_Type_size(combined_btype, iosize, ierr)
+        iseek = iseek + iosize/eltext
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ierr
+
+      call MPI_File_close(fp, ierr)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      subroutine accumulate_norms(xce_acc)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      double precision xce_acc(5)
+      integer m, ierr
+
+      if (rd_interval .gt. 0) goto 20
+
+      call MPI_File_open(comm_solve,
+     $          filenm,
+     $          MPI_MODE_RDONLY,
+     $          MPI_INFO_NULL,
+     $          fp,
+     $          ierr)
+
+      iseek = 0
+      call MPI_File_set_view(fp, iseek, element, combined_ftype,
+     $          'native', MPI_INFO_NULL, ierr)
+
+c     clear the last time step
+
+      call clear_timestep
+
+c     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      call MPI_File_close(fp, ierr)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/header.h
new file mode 100644
index 0000000..cb815eb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/header.h
@@ -0,0 +1,146 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  header.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      maxcells:      the square root of the maximum number of processors
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           ncells, grid_points(3)
+      double precision  elapsed_time
+      common /global/   elapsed_time, ncells, grid_points
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer           EAST, WEST, NORTH, SOUTH, 
+     >                  BOTTOM, TOP
+
+      parameter (EAST=2000, WEST=3000,      NORTH=4000, SOUTH=5000,
+     >           BOTTOM=6000, TOP=7000)
+
+      integer cell_coord (3,maxcells), cell_low (3,maxcells), 
+     >        cell_high  (3,maxcells), cell_size(3,maxcells),
+     >        predecessor(3),          slice    (3,maxcells),
+     >        grid_size  (3),          successor(3)         ,
+     >        start      (3,maxcells), end      (3,maxcells)
+      common /partition/ cell_coord, cell_low, cell_high, cell_size,
+     >                   grid_size, successor, predecessor, slice,
+     >                   start, end
+
+      integer IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE
+
+      parameter (MAX_CELL_DIM = (problem_size/maxcells)+1)
+
+      parameter (IMAX=MAX_CELL_DIM,JMAX=MAX_CELL_DIM,KMAX=MAX_CELL_DIM)
+
+      parameter (BUF_SIZE=MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1)
+
+      double precision 
+     >   us      (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   vs      (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   ws      (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   qs      (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   rho_i   (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   square  (    -1:IMAX,  -1:JMAX,  -1:KMAX,   maxcells),
+     >   forcing (5,   0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells),
+     >   u       (5,  -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells),
+     >   rhs     (5,  -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),
+     >   lhsc    (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),
+     >   backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells),
+     >   in_buffer(BUF_SIZE), out_buffer(BUF_SIZE)
+      common /fields/  u, us, vs, ws, qs, rho_i, square, 
+     >                 rhs, forcing, lhsc, in_buffer, out_buffer,
+     >                 backsub_info
+
+      double precision cv(-2:MAX_CELL_DIM+1),   rhon(-2:MAX_CELL_DIM+1),
+     >                 rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1),
+     >                 cuf(-2:MAX_CELL_DIM+1),  q(-2:MAX_CELL_DIM+1),
+     >                 ue(-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5)
+      common /work_1d/ cv, rhon, rhos, rhoq, cuf, q, ue, buf
+
+      integer  west_size, east_size, bottom_size, top_size,
+     >         north_size, south_size, start_send_west, 
+     >         start_send_east, start_send_south, start_send_north,
+     >         start_send_bottom, start_send_top, start_recv_west,
+     >         start_recv_east, start_recv_south, start_recv_north,
+     >         start_recv_bottom, start_recv_top
+      common /box/ west_size, east_size, bottom_size,
+     >             top_size, north_size, south_size, 
+     >             start_send_west, start_send_east, start_send_south,
+     >             start_send_north, start_send_bottom, start_send_top,
+     >             start_recv_west, start_recv_east, start_recv_south,
+     >             start_recv_north, start_recv_bottom, start_recv_top
+
+      double precision  tmp_block(5,5), b_inverse(5,5), tmp_vec(5)
+      common /work_solve/ tmp_block, b_inverse, tmp_vec
+
+c
+c     These are used by btio
+c
+      integer collbuf_nodes, collbuf_size, iosize, eltext,
+     $        combined_btype, fp, idump, record_length, element,
+     $        combined_ftype, idump_sub, rd_interval
+      common /btio/ collbuf_nodes, collbuf_size, iosize, eltext,
+     $              combined_btype, fp, idump, record_length,
+     $              idump_sub, rd_interval
+      double precision sum(niter_default), xce_sub(5)
+      common /btio/ sum, xce_sub
+      integer*8 iseek
+      common /btio/ iseek, element, combined_ftype
+
+
+      integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve, 
+     >        t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm, t_last
+      parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5, 
+     >        t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9, 
+     >        t_ycomm=10, t_zcomm=11, t_last=11)
+      logical timeron
+      common /tflags/ timeron
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/initialize.f
new file mode 100644
index 0000000..274cdb1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/initialize.f
@@ -0,0 +1,308 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This subroutine initializes the field variable u using 
+c     tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      integer c, i, j, k, m, ii, jj, kk, ix, iy, iz
+      double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >     Pzeta, temp(5)
+
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+      do c = 1, ncells
+         do kk = -1, KMAX
+            do jj = -1, JMAX
+               do ii = -1, IMAX
+                  do m = 1, 5
+                     u(m, ii, jj, kk, c) = 1.0
+                  end do
+               end do
+            end do
+         end do
+      end do
+c---------------------------------------------------------------------
+
+
+
+c---------------------------------------------------------------------
+c     first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+      do c=1, ncells
+         kk = 0
+         do k = cell_low(3,c), cell_high(3,c)
+            zeta = dble(k) * dnzm1
+            jj = 0
+            do j = cell_low(2,c), cell_high(2,c)
+               eta = dble(j) * dnym1
+               ii = 0
+               do i = cell_low(1,c), cell_high(1,c)
+                  xi = dble(i) * dnxm1
+                  
+                  do ix = 1, 2
+                     call exact_solution(dble(ix-1), eta, zeta, 
+     >                    Pface(1,1,ix))
+                  enddo
+
+                  do iy = 1, 2
+                     call exact_solution(xi, dble(iy-1) , zeta, 
+     >                    Pface(1,2,iy))
+                  enddo
+
+                  do iz = 1, 2
+                     call exact_solution(xi, eta, dble(iz-1),   
+     >                    Pface(1,3,iz))
+                  enddo
+
+                  do m = 1, 5
+                     Pxi   = xi   * Pface(m,1,2) + 
+     >                    (1.0d0-xi)   * Pface(m,1,1)
+                     Peta  = eta  * Pface(m,2,2) + 
+     >                    (1.0d0-eta)  * Pface(m,2,1)
+                     Pzeta = zeta * Pface(m,3,2) + 
+     >                    (1.0d0-zeta) * Pface(m,3,1)
+                     
+                     u(m,ii,jj,kk,c) = Pxi + Peta + Pzeta - 
+     >                    Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                    Pxi*Peta*Pzeta
+
+                  enddo
+                  ii = ii + 1
+               enddo
+               jj = jj + 1
+            enddo
+            kk = kk+1
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     west face                                                  
+c---------------------------------------------------------------------
+      c = slice(1,1)
+      ii = 0
+      xi = 0.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         jj = 0
+         do j = cell_low(2,c), cell_high(2,c)
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            jj = jj + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+c---------------------------------------------------------------------
+c     east face                                                      
+c---------------------------------------------------------------------
+      c  = slice(1,ncells)
+      ii = cell_size(1,c)-1
+      xi = 1.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         jj = 0
+         do j = cell_low(2,c), cell_high(2,c)
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            jj = jj + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+c---------------------------------------------------------------------
+c     south face                                                 
+c---------------------------------------------------------------------
+      c = slice(2,1)
+      jj = 0
+      eta = 0.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         ii = 0
+         do i = cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+
+c---------------------------------------------------------------------
+c     north face                                    
+c---------------------------------------------------------------------
+      c = slice(2,ncells)
+      jj = cell_size(2,c)-1
+      eta = 1.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         ii = 0
+         do i = cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+c---------------------------------------------------------------------
+c     bottom face                                       
+c---------------------------------------------------------------------
+      c = slice(3,1)
+      kk = 0
+      zeta = 0.0d0
+      jj = 0
+      do j = cell_low(2,c), cell_high(2,c)
+         eta = dble(j) * dnym1
+         ii = 0
+         do i =cell_low(1,c), cell_high(1,c)
+            xi = dble(i) *dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         jj = jj + 1
+      enddo
+
+c---------------------------------------------------------------------
+c     top face     
+c---------------------------------------------------------------------
+      c = slice(3,ncells)
+      kk = cell_size(3,c)-1
+      zeta = 1.0d0
+      jj = 0
+      do j = cell_low(2,c), cell_high(2,c)
+         eta = dble(j) * dnym1
+         ii = 0
+         do i =cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         jj = jj + 1
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine lhsinit
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      integer i, j, k, d, c, m, n
+
+c---------------------------------------------------------------------
+c     loop over all cells                                       
+c---------------------------------------------------------------------
+      do c = 1, ncells
+
+c---------------------------------------------------------------------
+c     first, initialize the start and end arrays
+c---------------------------------------------------------------------
+         do d = 1, 3
+            if (cell_coord(d,c) .eq. 1) then
+               start(d,c) = 1
+            else 
+               start(d,c) = 0
+            endif
+            if (cell_coord(d,c) .eq. ncells) then
+               end(d,c) = 1
+            else
+               end(d,c) = 0
+            endif
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c---------------------------------------------------------------------
+         do k = 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1,5
+                     do n = 1, 5
+                        lhsc(m,n,i,j,k,c) = 0.0d0
+                     enddo
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine lhsabinit(lhsa, lhsb, size)
+      implicit none
+
+      integer size
+      double precision lhsa(5, 5, -1:size), lhsb(5, 5, -1:size)
+
+      integer i, m, n
+
+c---------------------------------------------------------------------
+c     next, set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      do i = 0, size
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i) = 0.0d0
+               lhsb(m,n,i) = 0.0d0
+            enddo
+            lhsb(m,m,i) = 1.0d0
+         enddo
+      enddo
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/inputbt.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/inputbt.data.sample
new file mode 100644
index 0000000..776654e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/inputbt.data.sample
@@ -0,0 +1,5 @@
+200       number of time steps
+0.0008d0  dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
+64 64 64
+5 0        write interval (optional read interval) for BTIO
+0 1000000  number of nodes in collective buffering and buffer size for BTIO
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/make_set.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/make_set.f
new file mode 100644
index 0000000..ffab37c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/make_set.f
@@ -0,0 +1,125 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine make_set
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function allocates space for a set of cells and fills the set     
+c     such that communication between cells on different nodes is only
+c     nearest neighbor                                                   
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+
+      integer p, i, j, c, dir, size, excess, ierr,ierrcode
+
+c---------------------------------------------------------------------
+c     compute square root; add small number to allow for roundoff
+c     (note: this is computed in setup_mpi.f also, but prefer to do
+c     it twice because of some include file problems).
+c---------------------------------------------------------------------
+      ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0))
+
+c---------------------------------------------------------------------
+c     this makes coding easier
+c---------------------------------------------------------------------
+      p = ncells
+      
+c---------------------------------------------------------------------
+c     determine the location of the cell at the bottom of the 3D 
+c     array of cells
+c---------------------------------------------------------------------
+      cell_coord(1,1) = mod(node,p) 
+      cell_coord(2,1) = node/p 
+      cell_coord(3,1) = 0
+
+c---------------------------------------------------------------------
+c     set the cell_coords for cells in the rest of the z-layers; 
+c     this comes down to a simple linear numbering in the z-direct-
+c     ion, and to the doubly-cyclic numbering in the other dirs     
+c---------------------------------------------------------------------
+      do c=2, p
+         cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) 
+         cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) 
+         cell_coord(3,c) = c-1
+      end do
+
+c---------------------------------------------------------------------
+c     offset all the coordinates by 1 to adjust for Fortran arrays
+c---------------------------------------------------------------------
+      do dir = 1, 3
+         do c = 1, p
+            cell_coord(dir,c) = cell_coord(dir,c) + 1
+         end do
+      end do
+      
+c---------------------------------------------------------------------
+c     slice(dir,n) contains the sequence number of the cell that is in
+c     coordinate plane n in the dir direction
+c---------------------------------------------------------------------
+      do dir = 1, 3
+         do c = 1, p
+            slice(dir,cell_coord(dir,c)) = c
+         end do
+      end do
+
+
+c---------------------------------------------------------------------
+c     fill the predecessor and successor entries, using the indices 
+c     of the bottom cells (they are the same at each level of k 
+c     anyway) acting as if full periodicity pertains; note that p is
+c     added to those arguments to the mod functions that might
+c     otherwise return wrong values when using the modulo function
+c---------------------------------------------------------------------
+      i = cell_coord(1,1)-1
+      j = cell_coord(2,1)-1
+
+      predecessor(1) = mod(i-1+p,p) + p*j
+      predecessor(2) = i + p*mod(j-1+p,p)
+      predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p)
+      successor(1)   = mod(i+1,p) + p*j
+      successor(2)   = i + p*mod(j+1,p)
+      successor(3)   = mod(i-1+p,p) + p*mod(j+1,p)
+
+c---------------------------------------------------------------------
+c     now compute the sizes of the cells                                    
+c---------------------------------------------------------------------
+      do dir= 1, 3
+c---------------------------------------------------------------------
+c     set cell_coord range for each direction                            
+c---------------------------------------------------------------------
+         size   = grid_points(dir)/p
+         excess = mod(grid_points(dir),p)
+         do c=1, ncells
+            if (cell_coord(dir,c) .le. excess) then
+               cell_size(dir,c) = size+1
+               cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1)
+               cell_high(dir,c) = cell_low(dir,c)+size
+            else 
+               cell_size(dir,c) = size
+               cell_low(dir,c)  = excess*(size+1)+
+     >              (cell_coord(dir,c)-excess-1)*size
+               cell_high(dir,c) = cell_low(dir,c)+size-1
+            endif
+            if (cell_size(dir, c) .le. 2) then
+               write(*,50)
+ 50            format(' Error: Cell size too small. Min size is 3')
+               ierrcode = 1
+               call MPI_Abort(mpi_comm_world,ierrcode,ierr)
+               stop
+            endif
+         end do
+      end do
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/mpinpb.h
new file mode 100644
index 0000000..f621f08
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/mpinpb.h
@@ -0,0 +1,12 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'mpif.h'
+
+      integer           node, no_nodes, total_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type
+      logical           active
+      common /mpistuff/ node, no_nodes, total_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type, active
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/rhs.f
new file mode 100644
index 0000000..722f750
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/rhs.f
@@ -0,0 +1,428 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer c, i, j, k, m
+      double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >     wijk, wp1, wm1
+
+
+      if (timeron) call timer_start(t_rhs)
+c---------------------------------------------------------------------
+c     loop over all cells owned by this node                           
+c---------------------------------------------------------------------
+      do c = 1, ncells
+
+c---------------------------------------------------------------------
+c     compute the reciprocal of density, and the kinetic energy, 
+c     and the speed of sound.
+c---------------------------------------------------------------------
+         do k = -1, cell_size(3,c)
+            do j = -1, cell_size(2,c)
+               do i = -1, cell_size(1,c)
+                  rho_inv = 1.0d0/u(1,i,j,k,c)
+                  rho_i(i,j,k,c) = rho_inv
+                  us(i,j,k,c) = u(2,i,j,k,c) * rho_inv
+                  vs(i,j,k,c) = u(3,i,j,k,c) * rho_inv
+                  ws(i,j,k,c) = u(4,i,j,k,c) * rho_inv
+                  square(i,j,k,c)     = 0.5d0* (
+     >                 u(2,i,j,k,c)*u(2,i,j,k,c) + 
+     >                 u(3,i,j,k,c)*u(3,i,j,k,c) +
+     >                 u(4,i,j,k,c)*u(4,i,j,k,c) ) * rho_inv
+                  qs(i,j,k,c) = square(i,j,k,c) * rho_inv
+               enddo
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole of every 
+c cell,  including the boundary                   
+c---------------------------------------------------------------------
+
+         do k = 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = forcing(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     compute xi-direction fluxes 
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  uijk = us(i,j,k,c)
+                  up1  = us(i+1,j,k,c)
+                  um1  = us(i-1,j,k,c)
+
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dx1tx1 * 
+     >                 (u(1,i+1,j,k,c) - 2.0d0*u(1,i,j,k,c) + 
+     >                 u(1,i-1,j,k,c)) -
+     >                 tx2 * (u(2,i+1,j,k,c) - u(2,i-1,j,k,c))
+
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dx2tx1 * 
+     >                 (u(2,i+1,j,k,c) - 2.0d0*u(2,i,j,k,c) + 
+     >                 u(2,i-1,j,k,c)) +
+     >                 xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                 tx2 * (u(2,i+1,j,k,c)*up1 - 
+     >                 u(2,i-1,j,k,c)*um1 +
+     >                 (u(5,i+1,j,k,c)- square(i+1,j,k,c)-
+     >                 u(5,i-1,j,k,c)+ square(i-1,j,k,c))*
+     >                 c2)
+
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dx3tx1 * 
+     >                 (u(3,i+1,j,k,c) - 2.0d0*u(3,i,j,k,c) +
+     >                 u(3,i-1,j,k,c)) +
+     >                 xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) +
+     >                 vs(i-1,j,k,c)) -
+     >                 tx2 * (u(3,i+1,j,k,c)*up1 - 
+     >                 u(3,i-1,j,k,c)*um1)
+
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dx4tx1 * 
+     >                 (u(4,i+1,j,k,c) - 2.0d0*u(4,i,j,k,c) +
+     >                 u(4,i-1,j,k,c)) +
+     >                 xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) +
+     >                 ws(i-1,j,k,c)) -
+     >                 tx2 * (u(4,i+1,j,k,c)*up1 - 
+     >                 u(4,i-1,j,k,c)*um1)
+
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dx5tx1 * 
+     >                 (u(5,i+1,j,k,c) - 2.0d0*u(5,i,j,k,c) +
+     >                 u(5,i-1,j,k,c)) +
+     >                 xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) +
+     >                 qs(i-1,j,k,c)) +
+     >                 xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                 um1*um1) +
+     >                 xxcon5 * (u(5,i+1,j,k,c)*rho_i(i+1,j,k,c) - 
+     >                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +
+     >                 u(5,i-1,j,k,c)*rho_i(i-1,j,k,c)) -
+     >                 tx2 * ( (c1*u(5,i+1,j,k,c) - 
+     >                 c2*square(i+1,j,k,c))*up1 -
+     >                 (c1*u(5,i-1,j,k,c) - 
+     >                 c2*square(i-1,j,k,c))*um1 )
+               enddo
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+         if (start(1,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  i = 1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) +
+     >                    u(m,i+2,j,k,c))
+                  enddo
+
+                  i = 2
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (-4.0d0*u(m,i-1,j,k,c) + 6.0d0*u(m,i,j,k,c) -
+     >                    4.0d0*u(m,i+1,j,k,c) + u(m,i+2,j,k,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (  u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + 
+     >                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) + 
+     >                    u(m,i+2,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+
+         if (end(1,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  i = cell_size(1,c)-3
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + 
+     >                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) )
+                  enddo
+
+                  i = cell_size(1,c)-2
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i-2,j,k,c) - 4.d0*u(m,i-1,j,k,c) +
+     >                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+c---------------------------------------------------------------------
+c     compute eta-direction fluxes 
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  vijk = vs(i,j,k,c)
+                  vp1  = vs(i,j+1,k,c)
+                  vm1  = vs(i,j-1,k,c)
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dy1ty1 * 
+     >                 (u(1,i,j+1,k,c) - 2.0d0*u(1,i,j,k,c) + 
+     >                 u(1,i,j-1,k,c)) -
+     >                 ty2 * (u(3,i,j+1,k,c) - u(3,i,j-1,k,c))
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dy2ty1 * 
+     >                 (u(2,i,j+1,k,c) - 2.0d0*u(2,i,j,k,c) + 
+     >                 u(2,i,j-1,k,c)) +
+     >                 yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) + 
+     >                 us(i,j-1,k,c)) -
+     >                 ty2 * (u(2,i,j+1,k,c)*vp1 - 
+     >                 u(2,i,j-1,k,c)*vm1)
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dy3ty1 * 
+     >                 (u(3,i,j+1,k,c) - 2.0d0*u(3,i,j,k,c) + 
+     >                 u(3,i,j-1,k,c)) +
+     >                 yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                 ty2 * (u(3,i,j+1,k,c)*vp1 - 
+     >                 u(3,i,j-1,k,c)*vm1 +
+     >                 (u(5,i,j+1,k,c) - square(i,j+1,k,c) - 
+     >                 u(5,i,j-1,k,c) + square(i,j-1,k,c))
+     >                 *c2)
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dy4ty1 * 
+     >                 (u(4,i,j+1,k,c) - 2.0d0*u(4,i,j,k,c) + 
+     >                 u(4,i,j-1,k,c)) +
+     >                 yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) + 
+     >                 ws(i,j-1,k,c)) -
+     >                 ty2 * (u(4,i,j+1,k,c)*vp1 - 
+     >                 u(4,i,j-1,k,c)*vm1)
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dy5ty1 * 
+     >                 (u(5,i,j+1,k,c) - 2.0d0*u(5,i,j,k,c) + 
+     >                 u(5,i,j-1,k,c)) +
+     >                 yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) + 
+     >                 qs(i,j-1,k,c)) +
+     >                 yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                 vm1*vm1) +
+     >                 yycon5 * (u(5,i,j+1,k,c)*rho_i(i,j+1,k,c) - 
+     >                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +
+     >                 u(5,i,j-1,k,c)*rho_i(i,j-1,k,c)) -
+     >                 ty2 * ((c1*u(5,i,j+1,k,c) - 
+     >                 c2*square(i,j+1,k,c)) * vp1 -
+     >                 (c1*u(5,i,j-1,k,c) - 
+     >                 c2*square(i,j-1,k,c)) * vm1)
+               enddo
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+         if (start(2,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               j = 1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) +
+     >                    u(m,i,j+2,k,c))
+                  enddo
+               enddo
+
+               j = 2
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (-4.0d0*u(m,i,j-1,k,c) + 6.0d0*u(m,i,j,k,c) -
+     >                    4.0d0*u(m,i,j+1,k,c) + u(m,i,j+2,k,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+               do i = start(1,c),cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (  u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + 
+     >                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) + 
+     >                    u(m,i,j+2,k,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+         if (end(2,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               j = cell_size(2,c)-3
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + 
+     >                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) )
+                  enddo
+               enddo
+
+               j = cell_size(2,c)-2
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i,j-2,k,c) - 4.d0*u(m,i,j-1,k,c) +
+     >                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+c---------------------------------------------------------------------
+c     compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  wijk = ws(i,j,k,c)
+                  wp1  = ws(i,j,k+1,c)
+                  wm1  = ws(i,j,k-1,c)
+
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dz1tz1 * 
+     >                 (u(1,i,j,k+1,c) - 2.0d0*u(1,i,j,k,c) + 
+     >                 u(1,i,j,k-1,c)) -
+     >                 tz2 * (u(4,i,j,k+1,c) - u(4,i,j,k-1,c))
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dz2tz1 * 
+     >                 (u(2,i,j,k+1,c) - 2.0d0*u(2,i,j,k,c) + 
+     >                 u(2,i,j,k-1,c)) +
+     >                 zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) + 
+     >                 us(i,j,k-1,c)) -
+     >                 tz2 * (u(2,i,j,k+1,c)*wp1 - 
+     >                 u(2,i,j,k-1,c)*wm1)
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dz3tz1 * 
+     >                 (u(3,i,j,k+1,c) - 2.0d0*u(3,i,j,k,c) + 
+     >                 u(3,i,j,k-1,c)) +
+     >                 zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) + 
+     >                 vs(i,j,k-1,c)) -
+     >                 tz2 * (u(3,i,j,k+1,c)*wp1 - 
+     >                 u(3,i,j,k-1,c)*wm1)
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dz4tz1 * 
+     >                 (u(4,i,j,k+1,c) - 2.0d0*u(4,i,j,k,c) + 
+     >                 u(4,i,j,k-1,c)) +
+     >                 zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                 tz2 * (u(4,i,j,k+1,c)*wp1 - 
+     >                 u(4,i,j,k-1,c)*wm1 +
+     >                 (u(5,i,j,k+1,c) - square(i,j,k+1,c) - 
+     >                 u(5,i,j,k-1,c) + square(i,j,k-1,c))
+     >                 *c2)
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dz5tz1 * 
+     >                 (u(5,i,j,k+1,c) - 2.0d0*u(5,i,j,k,c) + 
+     >                 u(5,i,j,k-1,c)) +
+     >                 zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) + 
+     >                 qs(i,j,k-1,c)) +
+     >                 zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                 wm1*wm1) +
+     >                 zzcon5 * (u(5,i,j,k+1,c)*rho_i(i,j,k+1,c) - 
+     >                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +
+     >                 u(5,i,j,k-1,c)*rho_i(i,j,k-1,c)) -
+     >                 tz2 * ( (c1*u(5,i,j,k+1,c) - 
+     >                 c2*square(i,j,k+1,c))*wp1 -
+     >                 (c1*u(5,i,j,k-1,c) - 
+     >                 c2*square(i,j,k-1,c))*wm1)
+               enddo
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+         if (start(3,c) .gt. 0) then
+            k = 1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) +
+     >                    u(m,i,j,k+2,c))
+                  enddo
+               enddo
+            enddo
+
+            k = 2
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (-4.0d0*u(m,i,j,k-1,c) + 6.0d0*u(m,i,j,k,c) -
+     >                    4.0d0*u(m,i,j,k+1,c) + u(m,i,j,k+2,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c),cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * 
+     >                    (  u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + 
+     >                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) + 
+     >                    u(m,i,j,k+2,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+         if (end(3,c) .gt. 0) then
+            k = cell_size(3,c)-3
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + 
+     >                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) )
+                  enddo
+               enddo
+            enddo
+
+            k = cell_size(3,c)-2
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *
+     >                    ( u(m,i,j,k-2,c) - 4.d0*u(m,i,j,k-1,c) +
+     >                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) * dt
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+     
+      if (timeron) call timer_stop(t_rhs)
+     
+      return
+      end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/set_constants.f
new file mode 100644
index 0000000..81397d4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/set_constants.f
@@ -0,0 +1,202 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      ce(1,1)  = 2.0d0
+      ce(1,2)  = 0.0d0
+      ce(1,3)  = 0.0d0
+      ce(1,4)  = 4.0d0
+      ce(1,5)  = 5.0d0
+      ce(1,6)  = 3.0d0
+      ce(1,7)  = 0.5d0
+      ce(1,8)  = 0.02d0
+      ce(1,9)  = 0.01d0
+      ce(1,10) = 0.03d0
+      ce(1,11) = 0.5d0
+      ce(1,12) = 0.4d0
+      ce(1,13) = 0.3d0
+      
+      ce(2,1)  = 1.0d0
+      ce(2,2)  = 0.0d0
+      ce(2,3)  = 0.0d0
+      ce(2,4)  = 0.0d0
+      ce(2,5)  = 1.0d0
+      ce(2,6)  = 2.0d0
+      ce(2,7)  = 3.0d0
+      ce(2,8)  = 0.01d0
+      ce(2,9)  = 0.03d0
+      ce(2,10) = 0.02d0
+      ce(2,11) = 0.4d0
+      ce(2,12) = 0.3d0
+      ce(2,13) = 0.5d0
+
+      ce(3,1)  = 2.0d0
+      ce(3,2)  = 2.0d0
+      ce(3,3)  = 0.0d0
+      ce(3,4)  = 0.0d0
+      ce(3,5)  = 0.0d0
+      ce(3,6)  = 2.0d0
+      ce(3,7)  = 3.0d0
+      ce(3,8)  = 0.04d0
+      ce(3,9)  = 0.03d0
+      ce(3,10) = 0.05d0
+      ce(3,11) = 0.3d0
+      ce(3,12) = 0.5d0
+      ce(3,13) = 0.4d0
+
+      ce(4,1)  = 2.0d0
+      ce(4,2)  = 2.0d0
+      ce(4,3)  = 0.0d0
+      ce(4,4)  = 0.0d0
+      ce(4,5)  = 0.0d0
+      ce(4,6)  = 2.0d0
+      ce(4,7)  = 3.0d0
+      ce(4,8)  = 0.03d0
+      ce(4,9)  = 0.05d0
+      ce(4,10) = 0.04d0
+      ce(4,11) = 0.2d0
+      ce(4,12) = 0.1d0
+      ce(4,13) = 0.3d0
+
+      ce(5,1)  = 5.0d0
+      ce(5,2)  = 4.0d0
+      ce(5,3)  = 3.0d0
+      ce(5,4)  = 2.0d0
+      ce(5,5)  = 0.1d0
+      ce(5,6)  = 0.4d0
+      ce(5,7)  = 0.3d0
+      ce(5,8)  = 0.05d0
+      ce(5,9)  = 0.04d0
+      ce(5,10) = 0.03d0
+      ce(5,11) = 0.1d0
+      ce(5,12) = 0.3d0
+      ce(5,13) = 0.2d0
+
+      c1 = 1.4d0
+      c2 = 0.4d0
+      c3 = 0.1d0
+      c4 = 1.0d0
+      c5 = 1.4d0
+
+      bt = dsqrt(0.5d0)
+
+      dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+      dnym1 = 1.0d0 / dble(grid_points(2)-1)
+      dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+      c1c2 = c1 * c2
+      c1c5 = c1 * c5
+      c3c4 = c3 * c4
+      c1345 = c1c5 * c3c4
+
+      conz1 = (1.0d0-c1c5)
+
+      tx1 = 1.0d0 / (dnxm1 * dnxm1)
+      tx2 = 1.0d0 / (2.0d0 * dnxm1)
+      tx3 = 1.0d0 / dnxm1
+
+      ty1 = 1.0d0 / (dnym1 * dnym1)
+      ty2 = 1.0d0 / (2.0d0 * dnym1)
+      ty3 = 1.0d0 / dnym1
+      
+      tz1 = 1.0d0 / (dnzm1 * dnzm1)
+      tz2 = 1.0d0 / (2.0d0 * dnzm1)
+      tz3 = 1.0d0 / dnzm1
+
+      dx1 = 0.75d0
+      dx2 = 0.75d0
+      dx3 = 0.75d0
+      dx4 = 0.75d0
+      dx5 = 0.75d0
+
+      dy1 = 0.75d0
+      dy2 = 0.75d0
+      dy3 = 0.75d0
+      dy4 = 0.75d0
+      dy5 = 0.75d0
+
+      dz1 = 1.0d0
+      dz2 = 1.0d0
+      dz3 = 1.0d0
+      dz4 = 1.0d0
+      dz5 = 1.0d0
+
+      dxmax = dmax1(dx3, dx4)
+      dymax = dmax1(dy2, dy4)
+      dzmax = dmax1(dz2, dz3)
+
+      dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+      c4dssp = 4.0d0 * dssp
+      c5dssp = 5.0d0 * dssp
+
+      dttx1 = dt*tx1
+      dttx2 = dt*tx2
+      dtty1 = dt*ty1
+      dtty2 = dt*ty2
+      dttz1 = dt*tz1
+      dttz2 = dt*tz2
+
+      c2dttx1 = 2.0d0*dttx1
+      c2dtty1 = 2.0d0*dtty1
+      c2dttz1 = 2.0d0*dttz1
+
+      dtdssp = dt*dssp
+
+      comz1  = dtdssp
+      comz4  = 4.0d0*dtdssp
+      comz5  = 5.0d0*dtdssp
+      comz6  = 6.0d0*dtdssp
+
+      c3c4tx3 = c3c4*tx3
+      c3c4ty3 = c3c4*ty3
+      c3c4tz3 = c3c4*tz3
+
+      dx1tx1 = dx1*tx1
+      dx2tx1 = dx2*tx1
+      dx3tx1 = dx3*tx1
+      dx4tx1 = dx4*tx1
+      dx5tx1 = dx5*tx1
+      
+      dy1ty1 = dy1*ty1
+      dy2ty1 = dy2*ty1
+      dy3ty1 = dy3*ty1
+      dy4ty1 = dy4*ty1
+      dy5ty1 = dy5*ty1
+      
+      dz1tz1 = dz1*tz1
+      dz2tz1 = dz2*tz1
+      dz3tz1 = dz3*tz1
+      dz4tz1 = dz4*tz1
+      dz5tz1 = dz5*tz1
+
+      c2iv  = 2.5d0
+      con43 = 4.0d0/3.0d0
+      con16 = 1.0d0/6.0d0
+      
+      xxcon1 = c3c4tx3*con43*tx3
+      xxcon2 = c3c4tx3*tx3
+      xxcon3 = c3c4tx3*conz1*tx3
+      xxcon4 = c3c4tx3*con16*tx3
+      xxcon5 = c3c4tx3*c1c5*tx3
+
+      yycon1 = c3c4ty3*con43*ty3
+      yycon2 = c3c4ty3*ty3
+      yycon3 = c3c4ty3*conz1*ty3
+      yycon4 = c3c4ty3*con16*ty3
+      yycon5 = c3c4ty3*c1c5*ty3
+
+      zzcon1 = c3c4tz3*con43*tz3
+      zzcon2 = c3c4tz3*tz3
+      zzcon3 = c3c4tz3*conz1*tz3
+      zzcon4 = c3c4tz3*con16*tz3
+      zzcon5 = c3c4tz3*c1c5*tz3
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/setup_mpi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/setup_mpi.f
new file mode 100644
index 0000000..987c6bf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/setup_mpi.f
@@ -0,0 +1,64 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_mpi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c set up MPI stuff
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'mpinpb.h'
+      include 'npbparams.h'
+      integer error, color, nc
+
+      call mpi_init(error)
+      
+      call mpi_comm_size(MPI_COMM_WORLD, total_nodes, error)
+      call mpi_comm_rank(MPI_COMM_WORLD, node, error)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+c---------------------------------------------------------------------
+c     compute square root; add small number to allow for roundoff
+c---------------------------------------------------------------------
+      nc = dint(dsqrt(dble(total_nodes) + 0.00001d0))
+
+c---------------------------------------------------------------------
+c We handle a non-square number of nodes by making the excess nodes
+c inactive. However, we can never handle more cells than were compiled
+c in. 
+c---------------------------------------------------------------------
+
+      if (nc .gt. maxcells) nc = maxcells
+      if (node .ge. nc*nc) then
+         active = .false.
+         color = 1
+      else
+         active = .true.
+         color = 0
+      end if
+      
+      call mpi_comm_split(MPI_COMM_WORLD,color,node,comm_setup,error)
+      if (.not. active) return
+
+      call mpi_comm_size(comm_setup, no_nodes, error)
+      call mpi_comm_dup(comm_setup, comm_solve, error)
+      call mpi_comm_dup(comm_setup, comm_rhs, error)
+      
+c---------------------------------------------------------------------
+c     let node 0 be the root for the group (there is only one)
+c---------------------------------------------------------------------
+      root = 0
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/simple_mpiio.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/simple_mpiio.f
new file mode 100644
index 0000000..02e2700
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/simple_mpiio.f
@@ -0,0 +1,213 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer m, ierr
+
+      iseek=0
+
+      if (node .eq. root) then
+          call MPI_File_delete(filenm, MPI_INFO_NULL, ierr)
+      endif
+
+      call MPI_Barrier(comm_solve, ierr)
+
+      call MPI_File_open(comm_solve,
+     $          filenm,
+     $          MPI_MODE_RDWR + MPI_MODE_CREATE,
+     $          MPI_INFO_NULL,
+     $          fp,
+     $          ierr)
+
+      call MPI_File_set_view(fp,
+     $          iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION,
+     $          'native', MPI_INFO_NULL, ierr)
+
+      if (ierr .ne. MPI_SUCCESS) then
+          print *, 'Error opening file'
+          stop
+      endif
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer count, jio, kio, cio, aio
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=5*(cell_low(1,cio) +
+     $                   PROBLEM_SIZE*((cell_low(2,cio)+jio) +
+     $                   PROBLEM_SIZE*((cell_low(3,cio)+kio) +
+     $                   PROBLEM_SIZE*idump_sub)))
+
+                  count=5*cell_size(1,cio)
+
+                  call MPI_File_write_at(fp, iseek,
+     $                  u(1,0,jio,kio,cio),
+     $                  count, MPI_DOUBLE_PRECISION,
+     $                  mstatus, ierr)
+
+                  if (ierr .ne. MPI_SUCCESS) then
+                      print *, 'Error writing to file'
+                      stop
+                  endif
+              enddo
+          enddo
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            call acc_sub_norms(idump+1)
+
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer idump_cur
+
+      integer count, jio, kio, cio, ii, m, ichunk
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=5*(cell_low(1,cio) +
+     $                   PROBLEM_SIZE*((cell_low(2,cio)+jio) +
+     $                   PROBLEM_SIZE*((cell_low(3,cio)+kio) +
+     $                   PROBLEM_SIZE*ii)))
+
+                  count=5*cell_size(1,cio)
+
+                  call MPI_File_read_at(fp, iseek,
+     $                  u(1,0,jio,kio,cio),
+     $                  count, MPI_DOUBLE_PRECISION,
+     $                  mstatus, ierr)
+
+                  if (ierr .ne. MPI_SUCCESS) then
+                      print *, 'Error reading back file'
+                      call MPI_File_close(fp, ierr)
+                      stop
+                  endif
+              enddo
+          enddo
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ierr
+
+      call MPI_File_close(fp, ierr)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      double precision xce_acc(5)
+      integer m, ierr
+
+      if (rd_interval .gt. 0) goto 20
+
+      call MPI_File_open(comm_solve,
+     $          filenm,
+     $          MPI_MODE_RDONLY,
+     $          MPI_INFO_NULL,
+     $          fp,
+     $          ierr)
+
+      iseek = 0
+      call MPI_File_set_view(fp,
+     $          iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION,
+     $          'native', MPI_INFO_NULL, ierr)
+
+c     clear the last time step
+
+      call clear_timestep
+
+c     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      call MPI_File_close(fp, ierr)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/solve_subs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/solve_subs.f
new file mode 100644
index 0000000..351489a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/solve_subs.f
@@ -0,0 +1,642 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts bvec=bvec - ablock*avec
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock,avec,bvec
+      dimension ablock(5,5),avec(5),bvec(5)
+
+c---------------------------------------------------------------------
+c            rhs(i,ic,jc,kc,ccell) = rhs(i,ic,jc,kc,ccell) 
+c     $           - lhs(i,1,ablock,ia,ja,ka,acell)*
+c---------------------------------------------------------------------
+         bvec(1) = bvec(1) - ablock(1,1)*avec(1)
+     >                     - ablock(1,2)*avec(2)
+     >                     - ablock(1,3)*avec(3)
+     >                     - ablock(1,4)*avec(4)
+     >                     - ablock(1,5)*avec(5)
+         bvec(2) = bvec(2) - ablock(2,1)*avec(1)
+     >                     - ablock(2,2)*avec(2)
+     >                     - ablock(2,3)*avec(3)
+     >                     - ablock(2,4)*avec(4)
+     >                     - ablock(2,5)*avec(5)
+         bvec(3) = bvec(3) - ablock(3,1)*avec(1)
+     >                     - ablock(3,2)*avec(2)
+     >                     - ablock(3,3)*avec(3)
+     >                     - ablock(3,4)*avec(4)
+     >                     - ablock(3,5)*avec(5)
+         bvec(4) = bvec(4) - ablock(4,1)*avec(1)
+     >                     - ablock(4,2)*avec(2)
+     >                     - ablock(4,3)*avec(3)
+     >                     - ablock(4,4)*avec(4)
+     >                     - ablock(4,5)*avec(5)
+         bvec(5) = bvec(5) - ablock(5,1)*avec(1)
+     >                     - ablock(5,2)*avec(2)
+     >                     - ablock(5,3)*avec(3)
+     >                     - ablock(5,4)*avec(4)
+     >                     - ablock(5,5)*avec(5)
+
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock, bblock, cblock
+      dimension ablock(5,5), bblock(5,5), cblock(5,5)
+
+
+         cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1)
+     >                             - ablock(1,2)*bblock(2,1)
+     >                             - ablock(1,3)*bblock(3,1)
+     >                             - ablock(1,4)*bblock(4,1)
+     >                             - ablock(1,5)*bblock(5,1)
+         cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1)
+     >                             - ablock(2,2)*bblock(2,1)
+     >                             - ablock(2,3)*bblock(3,1)
+     >                             - ablock(2,4)*bblock(4,1)
+     >                             - ablock(2,5)*bblock(5,1)
+         cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1)
+     >                             - ablock(3,2)*bblock(2,1)
+     >                             - ablock(3,3)*bblock(3,1)
+     >                             - ablock(3,4)*bblock(4,1)
+     >                             - ablock(3,5)*bblock(5,1)
+         cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1)
+     >                             - ablock(4,2)*bblock(2,1)
+     >                             - ablock(4,3)*bblock(3,1)
+     >                             - ablock(4,4)*bblock(4,1)
+     >                             - ablock(4,5)*bblock(5,1)
+         cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1)
+     >                             - ablock(5,2)*bblock(2,1)
+     >                             - ablock(5,3)*bblock(3,1)
+     >                             - ablock(5,4)*bblock(4,1)
+     >                             - ablock(5,5)*bblock(5,1)
+         cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2)
+     >                             - ablock(1,2)*bblock(2,2)
+     >                             - ablock(1,3)*bblock(3,2)
+     >                             - ablock(1,4)*bblock(4,2)
+     >                             - ablock(1,5)*bblock(5,2)
+         cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2)
+     >                             - ablock(2,2)*bblock(2,2)
+     >                             - ablock(2,3)*bblock(3,2)
+     >                             - ablock(2,4)*bblock(4,2)
+     >                             - ablock(2,5)*bblock(5,2)
+         cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2)
+     >                             - ablock(3,2)*bblock(2,2)
+     >                             - ablock(3,3)*bblock(3,2)
+     >                             - ablock(3,4)*bblock(4,2)
+     >                             - ablock(3,5)*bblock(5,2)
+         cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2)
+     >                             - ablock(4,2)*bblock(2,2)
+     >                             - ablock(4,3)*bblock(3,2)
+     >                             - ablock(4,4)*bblock(4,2)
+     >                             - ablock(4,5)*bblock(5,2)
+         cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2)
+     >                             - ablock(5,2)*bblock(2,2)
+     >                             - ablock(5,3)*bblock(3,2)
+     >                             - ablock(5,4)*bblock(4,2)
+     >                             - ablock(5,5)*bblock(5,2)
+         cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3)
+     >                             - ablock(1,2)*bblock(2,3)
+     >                             - ablock(1,3)*bblock(3,3)
+     >                             - ablock(1,4)*bblock(4,3)
+     >                             - ablock(1,5)*bblock(5,3)
+         cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3)
+     >                             - ablock(2,2)*bblock(2,3)
+     >                             - ablock(2,3)*bblock(3,3)
+     >                             - ablock(2,4)*bblock(4,3)
+     >                             - ablock(2,5)*bblock(5,3)
+         cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3)
+     >                             - ablock(3,2)*bblock(2,3)
+     >                             - ablock(3,3)*bblock(3,3)
+     >                             - ablock(3,4)*bblock(4,3)
+     >                             - ablock(3,5)*bblock(5,3)
+         cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3)
+     >                             - ablock(4,2)*bblock(2,3)
+     >                             - ablock(4,3)*bblock(3,3)
+     >                             - ablock(4,4)*bblock(4,3)
+     >                             - ablock(4,5)*bblock(5,3)
+         cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3)
+     >                             - ablock(5,2)*bblock(2,3)
+     >                             - ablock(5,3)*bblock(3,3)
+     >                             - ablock(5,4)*bblock(4,3)
+     >                             - ablock(5,5)*bblock(5,3)
+         cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4)
+     >                             - ablock(1,2)*bblock(2,4)
+     >                             - ablock(1,3)*bblock(3,4)
+     >                             - ablock(1,4)*bblock(4,4)
+     >                             - ablock(1,5)*bblock(5,4)
+         cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4)
+     >                             - ablock(2,2)*bblock(2,4)
+     >                             - ablock(2,3)*bblock(3,4)
+     >                             - ablock(2,4)*bblock(4,4)
+     >                             - ablock(2,5)*bblock(5,4)
+         cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4)
+     >                             - ablock(3,2)*bblock(2,4)
+     >                             - ablock(3,3)*bblock(3,4)
+     >                             - ablock(3,4)*bblock(4,4)
+     >                             - ablock(3,5)*bblock(5,4)
+         cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4)
+     >                             - ablock(4,2)*bblock(2,4)
+     >                             - ablock(4,3)*bblock(3,4)
+     >                             - ablock(4,4)*bblock(4,4)
+     >                             - ablock(4,5)*bblock(5,4)
+         cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4)
+     >                             - ablock(5,2)*bblock(2,4)
+     >                             - ablock(5,3)*bblock(3,4)
+     >                             - ablock(5,4)*bblock(4,4)
+     >                             - ablock(5,5)*bblock(5,4)
+         cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5)
+     >                             - ablock(1,2)*bblock(2,5)
+     >                             - ablock(1,3)*bblock(3,5)
+     >                             - ablock(1,4)*bblock(4,5)
+     >                             - ablock(1,5)*bblock(5,5)
+         cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5)
+     >                             - ablock(2,2)*bblock(2,5)
+     >                             - ablock(2,3)*bblock(3,5)
+     >                             - ablock(2,4)*bblock(4,5)
+     >                             - ablock(2,5)*bblock(5,5)
+         cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5)
+     >                             - ablock(3,2)*bblock(2,5)
+     >                             - ablock(3,3)*bblock(3,5)
+     >                             - ablock(3,4)*bblock(4,5)
+     >                             - ablock(3,5)*bblock(5,5)
+         cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5)
+     >                             - ablock(4,2)*bblock(2,5)
+     >                             - ablock(4,3)*bblock(3,5)
+     >                             - ablock(4,4)*bblock(4,5)
+     >                             - ablock(4,5)*bblock(5,5)
+         cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5)
+     >                             - ablock(5,2)*bblock(2,5)
+     >                             - ablock(5,3)*bblock(3,5)
+     >                             - ablock(5,4)*bblock(4,5)
+     >                             - ablock(5,5)*bblock(5,5)
+
+              
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision c(5,5), r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      c(1,1) = c(1,1)*pivot
+      c(1,2) = c(1,2)*pivot
+      c(1,3) = c(1,3)*pivot
+      c(1,4) = c(1,4)*pivot
+      c(1,5) = c(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      c(2,1) = c(2,1) - coeff*c(1,1)
+      c(2,2) = c(2,2) - coeff*c(1,2)
+      c(2,3) = c(2,3) - coeff*c(1,3)
+      c(2,4) = c(2,4) - coeff*c(1,4)
+      c(2,5) = c(2,5) - coeff*c(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      c(3,1) = c(3,1) - coeff*c(1,1)
+      c(3,2) = c(3,2) - coeff*c(1,2)
+      c(3,3) = c(3,3) - coeff*c(1,3)
+      c(3,4) = c(3,4) - coeff*c(1,4)
+      c(3,5) = c(3,5) - coeff*c(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      c(4,1) = c(4,1) - coeff*c(1,1)
+      c(4,2) = c(4,2) - coeff*c(1,2)
+      c(4,3) = c(4,3) - coeff*c(1,3)
+      c(4,4) = c(4,4) - coeff*c(1,4)
+      c(4,5) = c(4,5) - coeff*c(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      c(5,1) = c(5,1) - coeff*c(1,1)
+      c(5,2) = c(5,2) - coeff*c(1,2)
+      c(5,3) = c(5,3) - coeff*c(1,3)
+      c(5,4) = c(5,4) - coeff*c(1,4)
+      c(5,5) = c(5,5) - coeff*c(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      c(2,1) = c(2,1)*pivot
+      c(2,2) = c(2,2)*pivot
+      c(2,3) = c(2,3)*pivot
+      c(2,4) = c(2,4)*pivot
+      c(2,5) = c(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      c(1,1) = c(1,1) - coeff*c(2,1)
+      c(1,2) = c(1,2) - coeff*c(2,2)
+      c(1,3) = c(1,3) - coeff*c(2,3)
+      c(1,4) = c(1,4) - coeff*c(2,4)
+      c(1,5) = c(1,5) - coeff*c(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      c(3,1) = c(3,1) - coeff*c(2,1)
+      c(3,2) = c(3,2) - coeff*c(2,2)
+      c(3,3) = c(3,3) - coeff*c(2,3)
+      c(3,4) = c(3,4) - coeff*c(2,4)
+      c(3,5) = c(3,5) - coeff*c(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      c(4,1) = c(4,1) - coeff*c(2,1)
+      c(4,2) = c(4,2) - coeff*c(2,2)
+      c(4,3) = c(4,3) - coeff*c(2,3)
+      c(4,4) = c(4,4) - coeff*c(2,4)
+      c(4,5) = c(4,5) - coeff*c(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      c(5,1) = c(5,1) - coeff*c(2,1)
+      c(5,2) = c(5,2) - coeff*c(2,2)
+      c(5,3) = c(5,3) - coeff*c(2,3)
+      c(5,4) = c(5,4) - coeff*c(2,4)
+      c(5,5) = c(5,5) - coeff*c(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      c(3,1) = c(3,1)*pivot
+      c(3,2) = c(3,2)*pivot
+      c(3,3) = c(3,3)*pivot
+      c(3,4) = c(3,4)*pivot
+      c(3,5) = c(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      c(1,1) = c(1,1) - coeff*c(3,1)
+      c(1,2) = c(1,2) - coeff*c(3,2)
+      c(1,3) = c(1,3) - coeff*c(3,3)
+      c(1,4) = c(1,4) - coeff*c(3,4)
+      c(1,5) = c(1,5) - coeff*c(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      c(2,1) = c(2,1) - coeff*c(3,1)
+      c(2,2) = c(2,2) - coeff*c(3,2)
+      c(2,3) = c(2,3) - coeff*c(3,3)
+      c(2,4) = c(2,4) - coeff*c(3,4)
+      c(2,5) = c(2,5) - coeff*c(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      c(4,1) = c(4,1) - coeff*c(3,1)
+      c(4,2) = c(4,2) - coeff*c(3,2)
+      c(4,3) = c(4,3) - coeff*c(3,3)
+      c(4,4) = c(4,4) - coeff*c(3,4)
+      c(4,5) = c(4,5) - coeff*c(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      c(5,1) = c(5,1) - coeff*c(3,1)
+      c(5,2) = c(5,2) - coeff*c(3,2)
+      c(5,3) = c(5,3) - coeff*c(3,3)
+      c(5,4) = c(5,4) - coeff*c(3,4)
+      c(5,5) = c(5,5) - coeff*c(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      c(4,1) = c(4,1)*pivot
+      c(4,2) = c(4,2)*pivot
+      c(4,3) = c(4,3)*pivot
+      c(4,4) = c(4,4)*pivot
+      c(4,5) = c(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      c(1,1) = c(1,1) - coeff*c(4,1)
+      c(1,2) = c(1,2) - coeff*c(4,2)
+      c(1,3) = c(1,3) - coeff*c(4,3)
+      c(1,4) = c(1,4) - coeff*c(4,4)
+      c(1,5) = c(1,5) - coeff*c(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      c(2,1) = c(2,1) - coeff*c(4,1)
+      c(2,2) = c(2,2) - coeff*c(4,2)
+      c(2,3) = c(2,3) - coeff*c(4,3)
+      c(2,4) = c(2,4) - coeff*c(4,4)
+      c(2,5) = c(2,5) - coeff*c(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      c(3,1) = c(3,1) - coeff*c(4,1)
+      c(3,2) = c(3,2) - coeff*c(4,2)
+      c(3,3) = c(3,3) - coeff*c(4,3)
+      c(3,4) = c(3,4) - coeff*c(4,4)
+      c(3,5) = c(3,5) - coeff*c(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      c(5,1) = c(5,1) - coeff*c(4,1)
+      c(5,2) = c(5,2) - coeff*c(4,2)
+      c(5,3) = c(5,3) - coeff*c(4,3)
+      c(5,4) = c(5,4) - coeff*c(4,4)
+      c(5,5) = c(5,5) - coeff*c(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      c(5,1) = c(5,1)*pivot
+      c(5,2) = c(5,2)*pivot
+      c(5,3) = c(5,3)*pivot
+      c(5,4) = c(5,4)*pivot
+      c(5,5) = c(5,5)*pivot
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      c(1,1) = c(1,1) - coeff*c(5,1)
+      c(1,2) = c(1,2) - coeff*c(5,2)
+      c(1,3) = c(1,3) - coeff*c(5,3)
+      c(1,4) = c(1,4) - coeff*c(5,4)
+      c(1,5) = c(1,5) - coeff*c(5,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      c(2,1) = c(2,1) - coeff*c(5,1)
+      c(2,2) = c(2,2) - coeff*c(5,2)
+      c(2,3) = c(2,3) - coeff*c(5,3)
+      c(2,4) = c(2,4) - coeff*c(5,4)
+      c(2,5) = c(2,5) - coeff*c(5,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      c(3,1) = c(3,1) - coeff*c(5,1)
+      c(3,2) = c(3,2) - coeff*c(5,2)
+      c(3,3) = c(3,3) - coeff*c(5,3)
+      c(3,4) = c(3,4) - coeff*c(5,4)
+      c(3,5) = c(3,5) - coeff*c(5,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      c(4,1) = c(4,1) - coeff*c(5,1)
+      c(4,2) = c(4,2) - coeff*c(5,2)
+      c(4,3) = c(4,3) - coeff*c(5,3)
+      c(4,4) = c(4,4) - coeff*c(5,4)
+      c(4,5) = c(4,5) - coeff*c(5,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/verify.f
new file mode 100644
index 0000000..d1863f2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/verify.f
@@ -0,0 +1,434 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+        include 'mpinpb.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+        verified = .true.
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+
+        if (iotype .ne. 0) then
+           call accumulate_norms(xce)
+        else
+           call error_norm(xce)
+        endif
+
+        call copy_faces
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        if (node .ne. 0) return
+
+        class = 'U'
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 60    ))  then
+
+           class = 'S'
+           dtref = 1.0d-2
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.7034283709541311d-01
+         xcrref(2) = 1.2975252070034097d-02
+         xcrref(3) = 3.2527926989486055d-02
+         xcrref(4) = 2.6436421275166801d-02
+         xcrref(5) = 1.9211784131744430d-01
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 4.9976913345811579d-04
+           xceref(2) = 4.5195666782961927d-05
+           xceref(3) = 7.3973765172921357d-05
+           xceref(4) = 7.3821238632439731d-05
+           xceref(5) = 8.9269630987491446d-04
+         else
+           xceref(1) = 0.1149036328945d+02
+           xceref(2) = 0.9156788904727d+00
+           xceref(3) = 0.2857899428614d+01
+           xceref(4) = 0.2598273346734d+01
+           xceref(5) = 0.2652795397547d+02
+         endif
+
+c---------------------------------------------------------------------
+c    reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 24) .and. 
+     >           (grid_points(2) .eq. 24) .and.
+     >           (grid_points(3) .eq. 24) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'W'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1125590409344d+03
+           xcrref(2) = 0.1180007595731d+02
+           xcrref(3) = 0.2710329767846d+02
+           xcrref(4) = 0.2469174937669d+02
+           xcrref(5) = 0.2638427874317d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.4419655736008d+01
+           xceref(2) = 0.4638531260002d+00
+           xceref(3) = 0.1011551749967d+01
+           xceref(4) = 0.9235878729944d+00
+           xceref(5) = 0.1018045837718d+02
+         else
+           xceref(1) = 0.6729594398612d+02
+           xceref(2) = 0.5264523081690d+01
+           xceref(3) = 0.1677107142637d+02
+           xceref(4) = 0.1508721463436d+02
+           xceref(5) = 0.1477018363393d+03
+         endif
+
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'A'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.0806346714637264d+02
+         xcrref(2) = 1.1319730901220813d+01
+         xcrref(3) = 2.5974354511582465d+01
+         xcrref(4) = 2.3665622544678910d+01
+         xcrref(5) = 2.5278963211748344d+02
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 4.2348416040525025d+00
+           xceref(2) = 4.4390282496995698d-01
+           xceref(3) = 9.6692480136345650d-01
+           xceref(4) = 8.8302063039765474d-01
+           xceref(5) = 9.7379901770829278d+00
+         else
+           xceref(1) = 0.6482218724961d+02
+           xceref(2) = 0.5066461714527d+01
+           xceref(3) = 0.1613931961359d+02
+           xceref(4) = 0.1452010201481d+02
+           xceref(5) = 0.1420099377681d+03
+         endif
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 200 time steps,
+c    with DT = 3.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'B'
+           dtref = 3.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.4233597229287254d+03
+         xcrref(2) = 9.9330522590150238d+01
+         xcrref(3) = 3.5646025644535285d+02
+         xcrref(4) = 3.2485447959084092d+02
+         xcrref(5) = 3.2707541254659363d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 5.2969847140936856d+01
+           xceref(2) = 4.4632896115670668d+00
+           xceref(3) = 1.3122573342210174d+01
+           xceref(4) = 1.2006925323559144d+01
+           xceref(5) = 1.2459576151035986d+02
+         else
+           xceref(1) = 0.1477545106464d+03
+           xceref(2) = 0.1108895555053d+02
+           xceref(3) = 0.3698065590331d+02
+           xceref(4) = 0.3310505581440d+02
+           xceref(5) = 0.3157928282563d+03
+         endif
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 200 time steps,
+c    with DT = 1.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'C'
+           dtref = 1.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.62398116551764615d+04
+         xcrref(2) = 0.50793239190423964d+03
+         xcrref(3) = 0.15423530093013596d+04
+         xcrref(4) = 0.13302387929291190d+04
+         xcrref(5) = 0.11604087428436455d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.16462008369091265d+03
+           xceref(2) = 0.11497107903824313d+02
+           xceref(3) = 0.41207446207461508d+02
+           xceref(4) = 0.37087651059694167d+02
+           xceref(5) = 0.36211053051841265d+03
+         else
+           xceref(1) = 0.2597156483475d+03
+           xceref(2) = 0.1985384289495d+02
+           xceref(3) = 0.6517950485788d+02
+           xceref(4) = 0.5757235541520d+02
+           xceref(5) = 0.5215668188726d+03
+         endif 
+
+
+c---------------------------------------------------------------------
+c    reference data for 408x408x408 grids after 250 time steps,
+c    with DT = 0.2d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'D'
+           dtref = 0.2d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2533188551738d+05
+         xcrref(2) = 0.2346393716980d+04
+         xcrref(3) = 0.6294554366904d+04
+         xcrref(4) = 0.5352565376030d+04
+         xcrref(5) = 0.3905864038618d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.3100009377557d+03
+           xceref(2) = 0.2424086324913d+02
+           xceref(3) = 0.7782212022645d+02
+           xceref(4) = 0.6835623860116d+02
+           xceref(5) = 0.6065737200368d+03
+         else
+           xceref(1) = 0.3813781566713d+03
+           xceref(2) = 0.3160872966198d+02
+           xceref(3) = 0.9593576357290d+02
+           xceref(4) = 0.8363391989815d+02
+           xceref(5) = 0.7063466087423d+03
+         endif
+
+
+c---------------------------------------------------------------------
+c    reference data for 1020x1020x1020 grids after 250 time steps,
+c    with DT = 0.4d-05
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'E'
+           dtref = 0.4d-5
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.9795372484517d+05
+         xcrref(2) = 0.9739814511521d+04
+         xcrref(3) = 0.2467606342965d+05
+         xcrref(4) = 0.2092419572860d+05
+         xcrref(5) = 0.1392138856939d+06
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.4327562208414d+03
+           xceref(2) = 0.3699051964887d+02
+           xceref(3) = 0.1089845040954d+03
+           xceref(4) = 0.9462517622043d+02
+           xceref(5) = 0.7765512765309d+03
+         else
+c  wr_interval = 5
+           xceref(1) = 0.4729898413058d+03
+           xceref(2) = 0.4145899331704d+02
+           xceref(3) = 0.1192850917138d+03
+           xceref(4) = 0.1032746026932d+03
+           xceref(5) = 0.8270322177634d+03
+c  wr_interval = 10
+c          xceref(1) = 0.4718135916251d+03
+c          xceref(2) = 0.4132620259096d+02
+c          xceref(3) = 0.1189831133503d+03
+c          xceref(4) = 0.1030212798803d+03
+c          xceref(5) = 0.8255924078458d+03
+        endif
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference 
+c    values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs.h
new file mode 100644
index 0000000..d9bc9e4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs.h
@@ -0,0 +1,14 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  work_lhs.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision fjac(5, 5, -2:MAX_CELL_DIM+1),
+     >                 njac(5, 5, -2:MAX_CELL_DIM+1),
+     >                 lhsa(5, 5, -1:MAX_CELL_DIM),
+     >                 lhsb(5, 5, -1:MAX_CELL_DIM),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhsa, lhsb, tmp1, tmp2, tmp3
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs_vec.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs_vec.h
new file mode 100644
index 0000000..a97054f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/work_lhs_vec.h
@@ -0,0 +1,14 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  work_lhs_vec.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision fjac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1),
+     >                 njac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1),
+     >                 lhsa(5, 5, -1:MAX_CELL_DIM,   -1:MAX_CELL_DIM),
+     >                 lhsb(5, 5, -1:MAX_CELL_DIM,   -1:MAX_CELL_DIM),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhsa, lhsb, tmp1, tmp2, tmp3
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve.f
new file mode 100644
index 0000000..ecc2c02
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve.f
@@ -0,0 +1,771 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+      integer  c, istart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      istart = 0
+
+      if (timeron) call timer_start(t_xsolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the x-direction
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(1,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+         
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsx(c)
+            call x_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsx(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+c---------------------------------------------------------------------
+c     install C'(istart) and rhs'(istart) to be used in this cell
+c---------------------------------------------------------------------
+            call x_unpack_solve_info(c)
+            call x_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call x_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(1,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call x_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+            call x_unpack_backsub_info(c)
+            call x_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call x_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_unpack_solve_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      integer j,k,m,n,ptr,c,istart 
+
+      istart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,istart-1,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine x_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(iend) and rhs'(iend) for
+c     all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer j,k,m,n,isize,ptr,c,jp,kp
+      integer error,send_id,buffer_size 
+
+      isize = cell_size(1,c)-1
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,isize,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(1),
+     >     WEST+jp+kp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(istart) for all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer j,k,n,ptr,c,istart,jp,kp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      istart = 0
+      jp = cell_coord(2,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,istart,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(1), 
+     >     EAST+jp+kp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(isize) for all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      integer j,k,n,ptr,c
+
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,jp,kp,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(1), 
+     >     EAST+jp+kp*NCELLS, comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer jp,kp,recv_id,error,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, predecessor(1), 
+     >     WEST+jp+kp*NCELLS,  comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine x_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, j, k
+      integer m,n,isize,jsize,ksize,istart
+      
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1      
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do j=start(2,c),jsize
+c---------------------------------------------------------------------
+c     U(isize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+     >                    - lhsc(m,n,isize,j,k,c)*
+     >                    backsub_info(n,j,k,c)
+c---------------------------------------------------------------------
+c     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+c     $                    - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c)
+c---------------------------------------------------------------------
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=start(2,c),jsize
+            do i=isize-1,istart,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,istart
+
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, isize)
+
+      do k=start(3,c),ksize 
+         do j=start(2,c),jsize
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+            do i = start(1,c)-1, cell_size(1,c) - end(1,c)
+
+               tmp1 = rho_i(i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i) = 0.0d+00
+               fjac(1,2,i) = 1.0d+00
+               fjac(1,3,i) = 0.0d+00
+               fjac(1,4,i) = 0.0d+00
+               fjac(1,5,i) = 0.0d+00
+
+               fjac(2,1,i) = -(u(2,i,j,k,c) * tmp2 * 
+     >              u(2,i,j,k,c))
+     >              + c2 * qs(i,j,k,c)
+               fjac(2,2,i) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(2,3,i) = - c2 * ( u(3,i,j,k,c) * tmp1 )
+               fjac(2,4,i) = - c2 * ( u(4,i,j,k,c) * tmp1 )
+               fjac(2,5,i) = c2
+
+               fjac(3,1,i) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2
+               fjac(3,2,i) = u(3,i,j,k,c) * tmp1
+               fjac(3,3,i) = u(2,i,j,k,c) * tmp1
+               fjac(3,4,i) = 0.0d+00
+               fjac(3,5,i) = 0.0d+00
+
+               fjac(4,1,i) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2
+               fjac(4,2,i) = u(4,i,j,k,c) * tmp1
+               fjac(4,3,i) = 0.0d+00
+               fjac(4,4,i) = u(2,i,j,k,c) * tmp1
+               fjac(4,5,i) = 0.0d+00
+
+               fjac(5,1,i) = ( c2 * 2.0d0 * qs(i,j,k,c)
+     >              - c1 * ( u(5,i,j,k,c) * tmp1 ) )
+     >              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(5,2,i) = c1 *  u(5,i,j,k,c) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2
+     >              + qs(i,j,k,c) )
+               fjac(5,3,i) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) )
+     >              * tmp2
+               fjac(5,4,i) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) )
+     >              * tmp2
+               fjac(5,5,i) = c1 * ( u(2,i,j,k,c) * tmp1 )
+
+               njac(1,1,i) = 0.0d+00
+               njac(1,2,i) = 0.0d+00
+               njac(1,3,i) = 0.0d+00
+               njac(1,4,i) = 0.0d+00
+               njac(1,5,i) = 0.0d+00
+
+               njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i) =   con43 * c3c4 * tmp1
+               njac(2,3,i) =   0.0d+00
+               njac(2,4,i) =   0.0d+00
+               njac(2,5,i) =   0.0d+00
+
+               njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i) =   0.0d+00
+               njac(3,3,i) =   c3c4 * tmp1
+               njac(3,4,i) =   0.0d+00
+               njac(3,5,i) =   0.0d+00
+
+               njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i) =   0.0d+00 
+               njac(4,3,i) =   0.0d+00
+               njac(4,4,i) =   c3c4 * tmp1
+               njac(4,5,i) =   0.0d+00
+
+               njac(5,1,i) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i) = ( c1345 ) * tmp1
+
+            enddo
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+            do i = start(1,c), isize - end(1,c)
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhsa(1,1,i) = - tmp2 * fjac(1,1,i-1)
+     >              - tmp1 * njac(1,1,i-1)
+     >              - tmp1 * dx1 
+               lhsa(1,2,i) = - tmp2 * fjac(1,2,i-1)
+     >              - tmp1 * njac(1,2,i-1)
+               lhsa(1,3,i) = - tmp2 * fjac(1,3,i-1)
+     >              - tmp1 * njac(1,3,i-1)
+               lhsa(1,4,i) = - tmp2 * fjac(1,4,i-1)
+     >              - tmp1 * njac(1,4,i-1)
+               lhsa(1,5,i) = - tmp2 * fjac(1,5,i-1)
+     >              - tmp1 * njac(1,5,i-1)
+
+               lhsa(2,1,i) = - tmp2 * fjac(2,1,i-1)
+     >              - tmp1 * njac(2,1,i-1)
+               lhsa(2,2,i) = - tmp2 * fjac(2,2,i-1)
+     >              - tmp1 * njac(2,2,i-1)
+     >              - tmp1 * dx2
+               lhsa(2,3,i) = - tmp2 * fjac(2,3,i-1)
+     >              - tmp1 * njac(2,3,i-1)
+               lhsa(2,4,i) = - tmp2 * fjac(2,4,i-1)
+     >              - tmp1 * njac(2,4,i-1)
+               lhsa(2,5,i) = - tmp2 * fjac(2,5,i-1)
+     >              - tmp1 * njac(2,5,i-1)
+
+               lhsa(3,1,i) = - tmp2 * fjac(3,1,i-1)
+     >              - tmp1 * njac(3,1,i-1)
+               lhsa(3,2,i) = - tmp2 * fjac(3,2,i-1)
+     >              - tmp1 * njac(3,2,i-1)
+               lhsa(3,3,i) = - tmp2 * fjac(3,3,i-1)
+     >              - tmp1 * njac(3,3,i-1)
+     >              - tmp1 * dx3 
+               lhsa(3,4,i) = - tmp2 * fjac(3,4,i-1)
+     >              - tmp1 * njac(3,4,i-1)
+               lhsa(3,5,i) = - tmp2 * fjac(3,5,i-1)
+     >              - tmp1 * njac(3,5,i-1)
+
+               lhsa(4,1,i) = - tmp2 * fjac(4,1,i-1)
+     >              - tmp1 * njac(4,1,i-1)
+               lhsa(4,2,i) = - tmp2 * fjac(4,2,i-1)
+     >              - tmp1 * njac(4,2,i-1)
+               lhsa(4,3,i) = - tmp2 * fjac(4,3,i-1)
+     >              - tmp1 * njac(4,3,i-1)
+               lhsa(4,4,i) = - tmp2 * fjac(4,4,i-1)
+     >              - tmp1 * njac(4,4,i-1)
+     >              - tmp1 * dx4
+               lhsa(4,5,i) = - tmp2 * fjac(4,5,i-1)
+     >              - tmp1 * njac(4,5,i-1)
+
+               lhsa(5,1,i) = - tmp2 * fjac(5,1,i-1)
+     >              - tmp1 * njac(5,1,i-1)
+               lhsa(5,2,i) = - tmp2 * fjac(5,2,i-1)
+     >              - tmp1 * njac(5,2,i-1)
+               lhsa(5,3,i) = - tmp2 * fjac(5,3,i-1)
+     >              - tmp1 * njac(5,3,i-1)
+               lhsa(5,4,i) = - tmp2 * fjac(5,4,i-1)
+     >              - tmp1 * njac(5,4,i-1)
+               lhsa(5,5,i) = - tmp2 * fjac(5,5,i-1)
+     >              - tmp1 * njac(5,5,i-1)
+     >              - tmp1 * dx5
+
+               lhsb(1,1,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhsb(1,2,i) = tmp1 * 2.0d+00 * njac(1,2,i)
+               lhsb(1,3,i) = tmp1 * 2.0d+00 * njac(1,3,i)
+               lhsb(1,4,i) = tmp1 * 2.0d+00 * njac(1,4,i)
+               lhsb(1,5,i) = tmp1 * 2.0d+00 * njac(1,5,i)
+
+               lhsb(2,1,i) = tmp1 * 2.0d+00 * njac(2,1,i)
+               lhsb(2,2,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhsb(2,3,i) = tmp1 * 2.0d+00 * njac(2,3,i)
+               lhsb(2,4,i) = tmp1 * 2.0d+00 * njac(2,4,i)
+               lhsb(2,5,i) = tmp1 * 2.0d+00 * njac(2,5,i)
+
+               lhsb(3,1,i) = tmp1 * 2.0d+00 * njac(3,1,i)
+               lhsb(3,2,i) = tmp1 * 2.0d+00 * njac(3,2,i)
+               lhsb(3,3,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhsb(3,4,i) = tmp1 * 2.0d+00 * njac(3,4,i)
+               lhsb(3,5,i) = tmp1 * 2.0d+00 * njac(3,5,i)
+
+               lhsb(4,1,i) = tmp1 * 2.0d+00 * njac(4,1,i)
+               lhsb(4,2,i) = tmp1 * 2.0d+00 * njac(4,2,i)
+               lhsb(4,3,i) = tmp1 * 2.0d+00 * njac(4,3,i)
+               lhsb(4,4,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhsb(4,5,i) = tmp1 * 2.0d+00 * njac(4,5,i)
+
+               lhsb(5,1,i) = tmp1 * 2.0d+00 * njac(5,1,i)
+               lhsb(5,2,i) = tmp1 * 2.0d+00 * njac(5,2,i)
+               lhsb(5,3,i) = tmp1 * 2.0d+00 * njac(5,3,i)
+               lhsb(5,4,i) = tmp1 * 2.0d+00 * njac(5,4,i)
+               lhsb(5,5,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i+1)
+     >              - tmp1 * njac(1,1,i+1)
+     >              - tmp1 * dx1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i+1)
+     >              - tmp1 * njac(1,2,i+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i+1)
+     >              - tmp1 * njac(1,3,i+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i+1)
+     >              - tmp1 * njac(1,4,i+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i+1)
+     >              - tmp1 * njac(1,5,i+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i+1)
+     >              - tmp1 * njac(2,1,i+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i+1)
+     >              - tmp1 * njac(2,2,i+1)
+     >              - tmp1 * dx2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i+1)
+     >              - tmp1 * njac(2,3,i+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i+1)
+     >              - tmp1 * njac(2,4,i+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i+1)
+     >              - tmp1 * njac(2,5,i+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i+1)
+     >              - tmp1 * njac(3,1,i+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i+1)
+     >              - tmp1 * njac(3,2,i+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i+1)
+     >              - tmp1 * njac(3,3,i+1)
+     >              - tmp1 * dx3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i+1)
+     >              - tmp1 * njac(3,4,i+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i+1)
+     >              - tmp1 * njac(3,5,i+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i+1)
+     >              - tmp1 * njac(4,1,i+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i+1)
+     >              - tmp1 * njac(4,2,i+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i+1)
+     >              - tmp1 * njac(4,3,i+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i+1)
+     >              - tmp1 * njac(4,4,i+1)
+     >              - tmp1 * dx4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i+1)
+     >              - tmp1 * njac(4,5,i+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i+1)
+     >              - tmp1 * njac(5,1,i+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i+1)
+     >              - tmp1 * njac(5,2,i+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i+1)
+     >              - tmp1 * njac(5,3,i+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i+1)
+     >              - tmp1 * njac(5,4,i+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i+1)
+     >              - tmp1 * njac(5,5,i+1)
+     >              - tmp1 * dx5
+
+            enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(istart,j,k) by b_inverse and copy back to c
+c     multiply rhs(istart) by b_inverse(istart) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,istart),
+     >                        lhsc(1,1,istart,j,k,c),
+     >                        rhs(1,istart,j,k,c) )
+
+            endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do i=istart+first,isize-last
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i),
+     >                         rhs(1,i-1,j,k,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i),
+     >                         lhsc(1,1,i-1,j,k,c),
+     >                         lhsb(1,1,i))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,isize),
+     >                         rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,isize),
+     >                         lhsc(1,1,isize-1,j,k,c),
+     >                         lhsb(1,1,isize))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,isize),
+     >                       rhs(1,isize,j,k,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve_vec.f
new file mode 100644
index 0000000..a4deef2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/x_solve_vec.f
@@ -0,0 +1,799 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+      integer  c, istart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      istart = 0
+
+      if (timeron) call timer_start(t_xsolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the x-direct
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(1,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+         
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsx(c)
+            call x_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsx(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+c---------------------------------------------------------------------
+c     install C'(istart) and rhs'(istart) to be used in this cell
+c---------------------------------------------------------------------
+            call x_unpack_solve_info(c)
+            call x_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call x_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(1,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call x_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+            call x_unpack_backsub_info(c)
+            call x_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call x_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_unpack_solve_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      integer j,k,m,n,ptr,c,istart 
+
+      istart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,istart-1,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine x_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(iend) and rhs'(iend) for
+c     all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer j,k,m,n,isize,ptr,c,jp,kp
+      integer error,send_id,buffer_size 
+
+      isize = cell_size(1,c)-1
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,isize,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(1),
+     >     WEST+jp+kp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(istart) for all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer j,k,n,ptr,c,istart,jp,kp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      istart = 0
+      jp = cell_coord(2,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,istart,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(1), 
+     >     EAST+jp+kp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(isize) for all j and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      integer j,k,n,ptr,c
+
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,jp,kp,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(1), 
+     >     EAST+jp+kp*NCELLS, comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer jp,kp,recv_id,error,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, predecessor(1), 
+     >     WEST+jp+kp*NCELLS,  comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine x_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, j, k
+      integer m,n,isize,jsize,ksize,istart
+      
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1      
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do j=start(2,c),jsize
+c---------------------------------------------------------------------
+c     U(isize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+     >                    - lhsc(m,n,isize,j,k,c)*
+     >                    backsub_info(n,j,k,c)
+c---------------------------------------------------------------------
+c     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+c     $                    - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c)
+c---------------------------------------------------------------------
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=start(2,c),jsize
+            do i=isize-1,istart,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,istart
+
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+c---------------------------------------------------------------------
+c     zero the left hand side for starters
+c     set diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      do j = 0, jsize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,0,j) = 0.0d0
+               lhsb(m,n,0,j) = 0.0d0
+               lhsa(m,n,isize,j) = 0.0d0
+               lhsb(m,n,isize,j) = 0.0d0
+            enddo
+            lhsb(m,m,0,j) = 1.0d0
+            lhsb(m,m,isize,j) = 1.0d0
+         enddo
+      enddo
+
+      do k=start(3,c),ksize 
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+         do j=start(2,c),jsize
+            do i = start(1,c)-1, cell_size(1,c) - end(1,c)
+
+               tmp1 = rho_i(i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 1.0d+00
+               fjac(1,3,i,j) = 0.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = -(u(2,i,j,k,c) * tmp2 * 
+     >              u(2,i,j,k,c))
+     >              + c2 * qs(i,j,k,c)
+               fjac(2,2,i,j) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(2,3,i,j) = - c2 * ( u(3,i,j,k,c) * tmp1 )
+               fjac(2,4,i,j) = - c2 * ( u(4,i,j,k,c) * tmp1 )
+               fjac(2,5,i,j) = c2
+
+               fjac(3,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2
+               fjac(3,2,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(3,3,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(3,4,i,j) = 0.0d+00
+               fjac(3,5,i,j) = 0.0d+00
+
+               fjac(4,1,i,j) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2
+               fjac(4,2,i,j) = u(4,i,j,k,c) * tmp1
+               fjac(4,3,i,j) = 0.0d+00
+               fjac(4,4,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c)
+     >              - c1 * ( u(5,i,j,k,c) * tmp1 ) )
+     >              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(5,2,i,j) = c1 *  u(5,i,j,k,c) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2
+     >              + qs(i,j,k,c) )
+               fjac(5,3,i,j) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) )
+     >              * tmp2
+               fjac(5,4,i,j) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * ( u(2,i,j,k,c) * tmp1 )
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,j) =   con43 * c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,j) =   0.0d+00 
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+         do j=start(2,c),jsize
+            do i = start(1,c), isize - end(1,c)
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i-1,j)
+     >              - tmp1 * njac(1,1,i-1,j)
+     >              - tmp1 * dx1 
+               lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i-1,j)
+     >              - tmp1 * njac(1,2,i-1,j)
+               lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i-1,j)
+     >              - tmp1 * njac(1,3,i-1,j)
+               lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i-1,j)
+     >              - tmp1 * njac(1,4,i-1,j)
+               lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i-1,j)
+     >              - tmp1 * njac(1,5,i-1,j)
+
+               lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i-1,j)
+     >              - tmp1 * njac(2,1,i-1,j)
+               lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i-1,j)
+     >              - tmp1 * njac(2,2,i-1,j)
+     >              - tmp1 * dx2
+               lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i-1,j)
+     >              - tmp1 * njac(2,3,i-1,j)
+               lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i-1,j)
+     >              - tmp1 * njac(2,4,i-1,j)
+               lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i-1,j)
+     >              - tmp1 * njac(2,5,i-1,j)
+
+               lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i-1,j)
+     >              - tmp1 * njac(3,1,i-1,j)
+               lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i-1,j)
+     >              - tmp1 * njac(3,2,i-1,j)
+               lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i-1,j)
+     >              - tmp1 * njac(3,3,i-1,j)
+     >              - tmp1 * dx3 
+               lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i-1,j)
+     >              - tmp1 * njac(3,4,i-1,j)
+               lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i-1,j)
+     >              - tmp1 * njac(3,5,i-1,j)
+
+               lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i-1,j)
+     >              - tmp1 * njac(4,1,i-1,j)
+               lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i-1,j)
+     >              - tmp1 * njac(4,2,i-1,j)
+               lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i-1,j)
+     >              - tmp1 * njac(4,3,i-1,j)
+               lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i-1,j)
+     >              - tmp1 * njac(4,4,i-1,j)
+     >              - tmp1 * dx4
+               lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i-1,j)
+     >              - tmp1 * njac(4,5,i-1,j)
+
+               lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i-1,j)
+     >              - tmp1 * njac(5,1,i-1,j)
+               lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i-1,j)
+     >              - tmp1 * njac(5,2,i-1,j)
+               lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i-1,j)
+     >              - tmp1 * njac(5,3,i-1,j)
+               lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i-1,j)
+     >              - tmp1 * njac(5,4,i-1,j)
+               lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i-1,j)
+     >              - tmp1 * njac(5,5,i-1,j)
+     >              - tmp1 * dx5
+
+               lhsb(1,1,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhsb(2,2,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhsb(3,3,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhsb(4,4,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhsb(5,5,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i+1,j)
+     >              - tmp1 * njac(1,1,i+1,j)
+     >              - tmp1 * dx1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i+1,j)
+     >              - tmp1 * njac(1,2,i+1,j)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i+1,j)
+     >              - tmp1 * njac(1,3,i+1,j)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i+1,j)
+     >              - tmp1 * njac(1,4,i+1,j)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i+1,j)
+     >              - tmp1 * njac(1,5,i+1,j)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i+1,j)
+     >              - tmp1 * njac(2,1,i+1,j)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i+1,j)
+     >              - tmp1 * njac(2,2,i+1,j)
+     >              - tmp1 * dx2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i+1,j)
+     >              - tmp1 * njac(2,3,i+1,j)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i+1,j)
+     >              - tmp1 * njac(2,4,i+1,j)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i+1,j)
+     >              - tmp1 * njac(2,5,i+1,j)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i+1,j)
+     >              - tmp1 * njac(3,1,i+1,j)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i+1,j)
+     >              - tmp1 * njac(3,2,i+1,j)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i+1,j)
+     >              - tmp1 * njac(3,3,i+1,j)
+     >              - tmp1 * dx3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i+1,j)
+     >              - tmp1 * njac(3,4,i+1,j)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i+1,j)
+     >              - tmp1 * njac(3,5,i+1,j)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i+1,j)
+     >              - tmp1 * njac(4,1,i+1,j)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i+1,j)
+     >              - tmp1 * njac(4,2,i+1,j)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i+1,j)
+     >              - tmp1 * njac(4,3,i+1,j)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i+1,j)
+     >              - tmp1 * njac(4,4,i+1,j)
+     >              - tmp1 * dx4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i+1,j)
+     >              - tmp1 * njac(4,5,i+1,j)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i+1,j)
+     >              - tmp1 * njac(5,1,i+1,j)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i+1,j)
+     >              - tmp1 * njac(5,2,i+1,j)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i+1,j)
+     >              - tmp1 * njac(5,3,i+1,j)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i+1,j)
+     >              - tmp1 * njac(5,4,i+1,j)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i+1,j)
+     >              - tmp1 * njac(5,5,i+1,j)
+     >              - tmp1 * dx5
+
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(istart,j,k) by b_inverse and copy back to c
+c     multiply rhs(istart) by b_inverse(istart) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+            do j=start(2,c),jsize
+               call binvcrhs( lhsb(1,1,istart,j),
+     >                        lhsc(1,1,istart,j,k,c),
+     >                        rhs(1,istart,j,k,c) )
+            enddo
+
+         endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+!dir$ ivdep
+!dir$ interchange(i,j)
+         do j=start(2,c),jsize
+            do i=istart+first,isize-last
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,j),
+     >                         rhs(1,i-1,j,k,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,j),
+     >                         lhsc(1,1,i-1,j,k,c),
+     >                         lhsb(1,1,i,j))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,j),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do j=start(2,c),jsize
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,isize,j),
+     >                         rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,isize,j),
+     >                         lhsc(1,1,isize-1,j,k,c),
+     >                         lhsb(1,1,isize,j))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,isize,j),
+     >                       rhs(1,isize,j,k,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve.f
new file mode 100644
index 0000000..50a028b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve.f
@@ -0,0 +1,781 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer 
+     >     c, jstart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      jstart = 0
+
+      if (timeron) call timer_start(t_ysolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the y-direction
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(2,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsy(c)
+            call y_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsy(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+c---------------------------------------------------------------------
+c     install C'(jstart+1) and rhs'(jstart+1) to be used in this cell
+c---------------------------------------------------------------------
+            call y_unpack_solve_info(c)
+            call y_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call y_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(2,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call y_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+            call y_unpack_backsub_info(c)
+            call y_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call y_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine y_unpack_solve_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,k,m,n,ptr,c,jstart 
+
+      jstart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine y_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(jend) and rhs'(jend) for
+c     all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,k,m,n,jsize,ptr,c,ip,kp
+      integer error,send_id,buffer_size 
+
+      jsize = cell_size(2,c)-1
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jsize,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(2),
+     >     SOUTH+ip+kp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(jstart) for all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,k,n,ptr,c,jstart,ip,kp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      jstart = 0
+      ip = cell_coord(1,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jstart,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(2), 
+     >     NORTH+ip+kp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(jsize) for all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,k,n,ptr,c 
+
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,ip,kp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(2), 
+     >     NORTH+ip+kp*NCELLS, comm_solve, 
+     >     recv_id, error)
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ip,kp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size, 
+     >     dp_type, predecessor(2), 
+     >     SOUTH+ip+kp*NCELLS,  comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,jstart
+      
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     U(jsize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) 
+     >                    - lhsc(m,n,i,jsize,k,c)*
+     >                    backsub_info(n,i,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=jsize-1,jstart,-1
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,jstart
+      double precision utmp(6,-2:JMAX+1)
+
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, jsize)
+
+      do k=start(3,c),ksize 
+         do i=start(1,c),isize
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+            do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+               utmp(1,j) = 1.0d0 / u(1,i,j,k,c)
+               utmp(2,j) = u(2,i,j,k,c)
+               utmp(3,j) = u(3,i,j,k,c)
+               utmp(4,j) = u(4,i,j,k,c)
+               utmp(5,j) = u(5,i,j,k,c)
+               utmp(6,j) = qs(i,j,k,c)
+            end do
+
+            do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+
+               tmp1 = utmp(1,j)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,j) = 0.0d+00
+               fjac(1,2,j) = 0.0d+00
+               fjac(1,3,j) = 1.0d+00
+               fjac(1,4,j) = 0.0d+00
+               fjac(1,5,j) = 0.0d+00
+
+               fjac(2,1,j) = - ( utmp(2,j)*utmp(3,j) )
+     >              * tmp2
+               fjac(2,2,j) = utmp(3,j) * tmp1
+               fjac(2,3,j) = utmp(2,j) * tmp1
+               fjac(2,4,j) = 0.0d+00
+               fjac(2,5,j) = 0.0d+00
+
+               fjac(3,1,j) = - ( utmp(3,j)*utmp(3,j)*tmp2)
+     >              + c2 * utmp(6,j)
+               fjac(3,2,j) = - c2 *  utmp(2,j) * tmp1
+               fjac(3,3,j) = ( 2.0d+00 - c2 )
+     >              *  utmp(3,j) * tmp1 
+               fjac(3,4,j) = - c2 * utmp(4,j) * tmp1 
+               fjac(3,5,j) = c2
+
+               fjac(4,1,j) = - ( utmp(3,j)*utmp(4,j) )
+     >              * tmp2
+               fjac(4,2,j) = 0.0d+00
+               fjac(4,3,j) = utmp(4,j) * tmp1
+               fjac(4,4,j) = utmp(3,j) * tmp1
+               fjac(4,5,j) = 0.0d+00
+
+               fjac(5,1,j) = ( c2 * 2.0d0 * utmp(6,j)
+     >              - c1 * utmp(5,j) * tmp1 ) 
+     >              * utmp(3,j) * tmp1 
+               fjac(5,2,j) = - c2 * utmp(2,j)*utmp(3,j) 
+     >              * tmp2
+               fjac(5,3,j) = c1 * utmp(5,j) * tmp1 
+     >              - c2 * ( utmp(6,j)
+     >              + utmp(3,j)*utmp(3,j) * tmp2 )
+               fjac(5,4,j) = - c2 * ( utmp(3,j)*utmp(4,j) )
+     >              * tmp2
+               fjac(5,5,j) = c1 * utmp(3,j) * tmp1 
+
+               njac(1,1,j) = 0.0d+00
+               njac(1,2,j) = 0.0d+00
+               njac(1,3,j) = 0.0d+00
+               njac(1,4,j) = 0.0d+00
+               njac(1,5,j) = 0.0d+00
+
+               njac(2,1,j) = - c3c4 * tmp2 * utmp(2,j)
+               njac(2,2,j) =   c3c4 * tmp1
+               njac(2,3,j) =   0.0d+00
+               njac(2,4,j) =   0.0d+00
+               njac(2,5,j) =   0.0d+00
+
+               njac(3,1,j) = - con43 * c3c4 * tmp2 * utmp(3,j)
+               njac(3,2,j) =   0.0d+00
+               njac(3,3,j) =   con43 * c3c4 * tmp1
+               njac(3,4,j) =   0.0d+00
+               njac(3,5,j) =   0.0d+00
+
+               njac(4,1,j) = - c3c4 * tmp2 * utmp(4,j)
+               njac(4,2,j) =   0.0d+00
+               njac(4,3,j) =   0.0d+00
+               njac(4,4,j) =   c3c4 * tmp1
+               njac(4,5,j) =   0.0d+00
+
+               njac(5,1,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (utmp(2,j)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (utmp(3,j)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (utmp(4,j)**2)
+     >              - c1345 * tmp2 * utmp(5,j)
+
+               njac(5,2,j) = (  c3c4 - c1345 ) * tmp2 * utmp(2,j)
+               njac(5,3,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * utmp(3,j)
+               njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * utmp(4,j)
+               njac(5,5,j) = ( c1345 ) * tmp1
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+            do j = start(2,c), jsize-end(2,c)
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhsa(1,1,j) = - tmp2 * fjac(1,1,j-1)
+     >              - tmp1 * njac(1,1,j-1)
+     >              - tmp1 * dy1 
+               lhsa(1,2,j) = - tmp2 * fjac(1,2,j-1)
+     >              - tmp1 * njac(1,2,j-1)
+               lhsa(1,3,j) = - tmp2 * fjac(1,3,j-1)
+     >              - tmp1 * njac(1,3,j-1)
+               lhsa(1,4,j) = - tmp2 * fjac(1,4,j-1)
+     >              - tmp1 * njac(1,4,j-1)
+               lhsa(1,5,j) = - tmp2 * fjac(1,5,j-1)
+     >              - tmp1 * njac(1,5,j-1)
+
+               lhsa(2,1,j) = - tmp2 * fjac(2,1,j-1)
+     >              - tmp1 * njac(2,1,j-1)
+               lhsa(2,2,j) = - tmp2 * fjac(2,2,j-1)
+     >              - tmp1 * njac(2,2,j-1)
+     >              - tmp1 * dy2
+               lhsa(2,3,j) = - tmp2 * fjac(2,3,j-1)
+     >              - tmp1 * njac(2,3,j-1)
+               lhsa(2,4,j) = - tmp2 * fjac(2,4,j-1)
+     >              - tmp1 * njac(2,4,j-1)
+               lhsa(2,5,j) = - tmp2 * fjac(2,5,j-1)
+     >              - tmp1 * njac(2,5,j-1)
+
+               lhsa(3,1,j) = - tmp2 * fjac(3,1,j-1)
+     >              - tmp1 * njac(3,1,j-1)
+               lhsa(3,2,j) = - tmp2 * fjac(3,2,j-1)
+     >              - tmp1 * njac(3,2,j-1)
+               lhsa(3,3,j) = - tmp2 * fjac(3,3,j-1)
+     >              - tmp1 * njac(3,3,j-1)
+     >              - tmp1 * dy3 
+               lhsa(3,4,j) = - tmp2 * fjac(3,4,j-1)
+     >              - tmp1 * njac(3,4,j-1)
+               lhsa(3,5,j) = - tmp2 * fjac(3,5,j-1)
+     >              - tmp1 * njac(3,5,j-1)
+
+               lhsa(4,1,j) = - tmp2 * fjac(4,1,j-1)
+     >              - tmp1 * njac(4,1,j-1)
+               lhsa(4,2,j) = - tmp2 * fjac(4,2,j-1)
+     >              - tmp1 * njac(4,2,j-1)
+               lhsa(4,3,j) = - tmp2 * fjac(4,3,j-1)
+     >              - tmp1 * njac(4,3,j-1)
+               lhsa(4,4,j) = - tmp2 * fjac(4,4,j-1)
+     >              - tmp1 * njac(4,4,j-1)
+     >              - tmp1 * dy4
+               lhsa(4,5,j) = - tmp2 * fjac(4,5,j-1)
+     >              - tmp1 * njac(4,5,j-1)
+
+               lhsa(5,1,j) = - tmp2 * fjac(5,1,j-1)
+     >              - tmp1 * njac(5,1,j-1)
+               lhsa(5,2,j) = - tmp2 * fjac(5,2,j-1)
+     >              - tmp1 * njac(5,2,j-1)
+               lhsa(5,3,j) = - tmp2 * fjac(5,3,j-1)
+     >              - tmp1 * njac(5,3,j-1)
+               lhsa(5,4,j) = - tmp2 * fjac(5,4,j-1)
+     >              - tmp1 * njac(5,4,j-1)
+               lhsa(5,5,j) = - tmp2 * fjac(5,5,j-1)
+     >              - tmp1 * njac(5,5,j-1)
+     >              - tmp1 * dy5
+
+               lhsb(1,1,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhsb(1,2,j) = tmp1 * 2.0d+00 * njac(1,2,j)
+               lhsb(1,3,j) = tmp1 * 2.0d+00 * njac(1,3,j)
+               lhsb(1,4,j) = tmp1 * 2.0d+00 * njac(1,4,j)
+               lhsb(1,5,j) = tmp1 * 2.0d+00 * njac(1,5,j)
+
+               lhsb(2,1,j) = tmp1 * 2.0d+00 * njac(2,1,j)
+               lhsb(2,2,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhsb(2,3,j) = tmp1 * 2.0d+00 * njac(2,3,j)
+               lhsb(2,4,j) = tmp1 * 2.0d+00 * njac(2,4,j)
+               lhsb(2,5,j) = tmp1 * 2.0d+00 * njac(2,5,j)
+
+               lhsb(3,1,j) = tmp1 * 2.0d+00 * njac(3,1,j)
+               lhsb(3,2,j) = tmp1 * 2.0d+00 * njac(3,2,j)
+               lhsb(3,3,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhsb(3,4,j) = tmp1 * 2.0d+00 * njac(3,4,j)
+               lhsb(3,5,j) = tmp1 * 2.0d+00 * njac(3,5,j)
+
+               lhsb(4,1,j) = tmp1 * 2.0d+00 * njac(4,1,j)
+               lhsb(4,2,j) = tmp1 * 2.0d+00 * njac(4,2,j)
+               lhsb(4,3,j) = tmp1 * 2.0d+00 * njac(4,3,j)
+               lhsb(4,4,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhsb(4,5,j) = tmp1 * 2.0d+00 * njac(4,5,j)
+
+               lhsb(5,1,j) = tmp1 * 2.0d+00 * njac(5,1,j)
+               lhsb(5,2,j) = tmp1 * 2.0d+00 * njac(5,2,j)
+               lhsb(5,3,j) = tmp1 * 2.0d+00 * njac(5,3,j)
+               lhsb(5,4,j) = tmp1 * 2.0d+00 * njac(5,4,j)
+               lhsb(5,5,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,j+1)
+     >              - tmp1 * njac(1,1,j+1)
+     >              - tmp1 * dy1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,j+1)
+     >              - tmp1 * njac(1,2,j+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,j+1)
+     >              - tmp1 * njac(1,3,j+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,j+1)
+     >              - tmp1 * njac(1,4,j+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,j+1)
+     >              - tmp1 * njac(1,5,j+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,j+1)
+     >              - tmp1 * njac(2,1,j+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,j+1)
+     >              - tmp1 * njac(2,2,j+1)
+     >              - tmp1 * dy2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,j+1)
+     >              - tmp1 * njac(2,3,j+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,j+1)
+     >              - tmp1 * njac(2,4,j+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,j+1)
+     >              - tmp1 * njac(2,5,j+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,j+1)
+     >              - tmp1 * njac(3,1,j+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,j+1)
+     >              - tmp1 * njac(3,2,j+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,j+1)
+     >              - tmp1 * njac(3,3,j+1)
+     >              - tmp1 * dy3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,j+1)
+     >              - tmp1 * njac(3,4,j+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,j+1)
+     >              - tmp1 * njac(3,5,j+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,j+1)
+     >              - tmp1 * njac(4,1,j+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,j+1)
+     >              - tmp1 * njac(4,2,j+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,j+1)
+     >              - tmp1 * njac(4,3,j+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,j+1)
+     >              - tmp1 * njac(4,4,j+1)
+     >              - tmp1 * dy4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,j+1)
+     >              - tmp1 * njac(4,5,j+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,j+1)
+     >              - tmp1 * njac(5,1,j+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,j+1)
+     >              - tmp1 * njac(5,2,j+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,j+1)
+     >              - tmp1 * njac(5,3,j+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,j+1)
+     >              - tmp1 * njac(5,4,j+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,j+1)
+     >              - tmp1 * njac(5,5,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(i,jstart,k) by b_inverse and copy back to c
+c     multiply rhs(jstart) by b_inverse(jstart) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,jstart),
+     >                        lhsc(1,1,i,jstart,k,c),
+     >                        rhs(1,i,jstart,k,c) )
+
+            endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do j=jstart+first,jsize-last
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,j),
+     >                         rhs(1,i,j-1,k,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,j),
+     >                         lhsc(1,1,i,j-1,k,c),
+     >                         lhsb(1,1,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,j),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,jsize),
+     >                         rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,jsize),
+     >                         lhsc(1,1,i,jsize-1,k,c),
+     >                         lhsb(1,1,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,jsize),
+     >                       rhs(1,i,jsize,k,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve_vec.f
new file mode 100644
index 0000000..e954028
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/y_solve_vec.f
@@ -0,0 +1,798 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer 
+     >     c, jstart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      jstart = 0
+
+      if (timeron) call timer_start(t_ysolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the y-direct
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(2,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsy(c)
+            call y_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsy(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+c---------------------------------------------------------------------
+c     install C'(jstart+1) and rhs'(jstart+1) to be used in this cell
+c---------------------------------------------------------------------
+            call y_unpack_solve_info(c)
+            call y_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call y_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(2,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call y_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+            call y_unpack_backsub_info(c)
+            call y_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call y_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine y_unpack_solve_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,k,m,n,ptr,c,jstart 
+
+      jstart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine y_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(jend) and rhs'(jend) for
+c     all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,k,m,n,jsize,ptr,c,ip,kp
+      integer error,send_id,buffer_size 
+
+      jsize = cell_size(2,c)-1
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jsize,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(2),
+     >     SOUTH+ip+kp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(jstart) for all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,k,n,ptr,c,jstart,ip,kp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      jstart = 0
+      ip = cell_coord(1,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jstart,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(2), 
+     >     NORTH+ip+kp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(jsize) for all i and k
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,k,n,ptr,c 
+
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,ip,kp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(2), 
+     >     NORTH+ip+kp*NCELLS, comm_solve, 
+     >     recv_id, error)
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ip,kp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size, 
+     >     dp_type, predecessor(2), 
+     >     SOUTH+ip+kp*NCELLS,  comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,jstart
+      
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     U(jsize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) 
+     >                    - lhsc(m,n,i,jsize,k,c)*
+     >                    backsub_info(n,i,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=jsize-1,jstart,-1
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,jstart
+
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+c---------------------------------------------------------------------
+c     zero the left hand side for starters
+c     set diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      do i = 0, isize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i,0) = 0.0d0
+               lhsb(m,n,i,0) = 0.0d0
+               lhsa(m,n,i,jsize) = 0.0d0
+               lhsb(m,n,i,jsize) = 0.0d0
+            enddo
+            lhsb(m,m,i,0) = 1.0d0
+            lhsb(m,m,i,jsize) = 1.0d0
+         enddo
+      enddo
+
+      do k=start(3,c),ksize 
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+
+         do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+            do i=start(1,c),isize
+
+               tmp1 = 1.0d0 / u(1,i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 0.0d+00
+               fjac(1,3,i,j) = 1.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) )
+     >              * tmp2
+               fjac(2,2,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(2,3,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(2,4,i,j) = 0.0d+00
+               fjac(2,5,i,j) = 0.0d+00
+
+               fjac(3,1,i,j) = - ( u(3,i,j,k,c)*u(3,i,j,k,c)*tmp2)
+     >              + c2 * qs(i,j,k,c)
+               fjac(3,2,i,j) = - c2 *  u(2,i,j,k,c) * tmp1
+               fjac(3,3,i,j) = ( 2.0d+00 - c2 )
+     >              *  u(3,i,j,k,c) * tmp1 
+               fjac(3,4,i,j) = - c2 * u(4,i,j,k,c) * tmp1 
+               fjac(3,5,i,j) = c2
+
+               fjac(4,1,i,j) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) )
+     >              * tmp2
+               fjac(4,2,i,j) = 0.0d+00
+               fjac(4,3,i,j) = u(4,i,j,k,c) * tmp1
+               fjac(4,4,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c)
+     >              - c1 * u(5,i,j,k,c) * tmp1 ) 
+     >              * u(3,i,j,k,c) * tmp1 
+               fjac(5,2,i,j) = - c2 * u(2,i,j,k,c)*u(3,i,j,k,c) 
+     >              * tmp2
+               fjac(5,3,i,j) = c1 * u(5,i,j,k,c) * tmp1 
+     >              - c2 * ( qs(i,j,k,c)
+     >              + u(3,i,j,k,c)*u(3,i,j,k,c) * tmp2 )
+               fjac(5,4,i,j) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * u(3,i,j,k,c) * tmp1 
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,j) =   c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   con43 * c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,j) =   0.0d+00
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+         do j = start(2,c), jsize-end(2,c)
+            do i=start(1,c),isize
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i,j-1)
+     >              - tmp1 * njac(1,1,i,j-1)
+     >              - tmp1 * dy1 
+               lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i,j-1)
+     >              - tmp1 * njac(1,2,i,j-1)
+               lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i,j-1)
+     >              - tmp1 * njac(1,3,i,j-1)
+               lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i,j-1)
+     >              - tmp1 * njac(1,4,i,j-1)
+               lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i,j-1)
+     >              - tmp1 * njac(1,5,i,j-1)
+
+               lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i,j-1)
+     >              - tmp1 * njac(2,1,i,j-1)
+               lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i,j-1)
+     >              - tmp1 * njac(2,2,i,j-1)
+     >              - tmp1 * dy2
+               lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i,j-1)
+     >              - tmp1 * njac(2,3,i,j-1)
+               lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i,j-1)
+     >              - tmp1 * njac(2,4,i,j-1)
+               lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i,j-1)
+     >              - tmp1 * njac(2,5,i,j-1)
+
+               lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i,j-1)
+     >              - tmp1 * njac(3,1,i,j-1)
+               lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i,j-1)
+     >              - tmp1 * njac(3,2,i,j-1)
+               lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i,j-1)
+     >              - tmp1 * njac(3,3,i,j-1)
+     >              - tmp1 * dy3 
+               lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i,j-1)
+     >              - tmp1 * njac(3,4,i,j-1)
+               lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i,j-1)
+     >              - tmp1 * njac(3,5,i,j-1)
+
+               lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i,j-1)
+     >              - tmp1 * njac(4,1,i,j-1)
+               lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i,j-1)
+     >              - tmp1 * njac(4,2,i,j-1)
+               lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i,j-1)
+     >              - tmp1 * njac(4,3,i,j-1)
+               lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i,j-1)
+     >              - tmp1 * njac(4,4,i,j-1)
+     >              - tmp1 * dy4
+               lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i,j-1)
+     >              - tmp1 * njac(4,5,i,j-1)
+
+               lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i,j-1)
+     >              - tmp1 * njac(5,1,i,j-1)
+               lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i,j-1)
+     >              - tmp1 * njac(5,2,i,j-1)
+               lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i,j-1)
+     >              - tmp1 * njac(5,3,i,j-1)
+               lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i,j-1)
+     >              - tmp1 * njac(5,4,i,j-1)
+               lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i,j-1)
+     >              - tmp1 * njac(5,5,i,j-1)
+     >              - tmp1 * dy5
+
+               lhsb(1,1,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhsb(2,2,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhsb(3,3,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhsb(4,4,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhsb(5,5,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i,j+1)
+     >              - tmp1 * njac(1,1,i,j+1)
+     >              - tmp1 * dy1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i,j+1)
+     >              - tmp1 * njac(1,2,i,j+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i,j+1)
+     >              - tmp1 * njac(1,3,i,j+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i,j+1)
+     >              - tmp1 * njac(1,4,i,j+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i,j+1)
+     >              - tmp1 * njac(1,5,i,j+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i,j+1)
+     >              - tmp1 * njac(2,1,i,j+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i,j+1)
+     >              - tmp1 * njac(2,2,i,j+1)
+     >              - tmp1 * dy2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i,j+1)
+     >              - tmp1 * njac(2,3,i,j+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i,j+1)
+     >              - tmp1 * njac(2,4,i,j+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i,j+1)
+     >              - tmp1 * njac(2,5,i,j+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i,j+1)
+     >              - tmp1 * njac(3,1,i,j+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i,j+1)
+     >              - tmp1 * njac(3,2,i,j+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i,j+1)
+     >              - tmp1 * njac(3,3,i,j+1)
+     >              - tmp1 * dy3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i,j+1)
+     >              - tmp1 * njac(3,4,i,j+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i,j+1)
+     >              - tmp1 * njac(3,5,i,j+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i,j+1)
+     >              - tmp1 * njac(4,1,i,j+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i,j+1)
+     >              - tmp1 * njac(4,2,i,j+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i,j+1)
+     >              - tmp1 * njac(4,3,i,j+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i,j+1)
+     >              - tmp1 * njac(4,4,i,j+1)
+     >              - tmp1 * dy4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i,j+1)
+     >              - tmp1 * njac(4,5,i,j+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i,j+1)
+     >              - tmp1 * njac(5,1,i,j+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i,j+1)
+     >              - tmp1 * njac(5,2,i,j+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i,j+1)
+     >              - tmp1 * njac(5,3,i,j+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i,j+1)
+     >              - tmp1 * njac(5,4,i,j+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i,j+1)
+     >              - tmp1 * njac(5,5,i,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(i,jstart,k) by b_inverse and copy back to c
+c     multiply rhs(jstart) by b_inverse(jstart) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+            do i=start(1,c),isize
+               call binvcrhs( lhsb(1,1,i,jstart),
+     >                        lhsc(1,1,i,jstart,k,c),
+     >                        rhs(1,i,jstart,k,c) )
+            enddo
+
+         endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do j=jstart+first,jsize-last
+!dir$ ivdep
+            do i=start(1,c),isize
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,j),
+     >                         rhs(1,i,j-1,k,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,j),
+     >                         lhsc(1,1,i,j-1,k,c),
+     >                         lhsb(1,1,i,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,j),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,jsize),
+     >                         rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,jsize),
+     >                         lhsc(1,1,i,jsize-1,k,c),
+     >                         lhsb(1,1,i,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,i,jsize),
+     >                       rhs(1,i,jsize,k,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve.f
new file mode 100644
index 0000000..796fccd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve.f
@@ -0,0 +1,786 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer c, kstart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      kstart = 0
+
+      if (timeron) call timer_start(t_zsolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the y-direction
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(3,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsz(c)
+            call z_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsz(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+c---------------------------------------------------------------------
+c     install C'(kstart+1) and rhs'(kstart+1) to be used in this cell
+c---------------------------------------------------------------------
+            call z_unpack_solve_info(c)
+            call z_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call z_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(3,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call z_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+            call z_unpack_backsub_info(c)
+            call z_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call z_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine z_unpack_solve_info(c)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,m,n,ptr,c,kstart 
+
+      kstart = 0
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine z_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(kend) and rhs'(kend) for
+c     all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,j,m,n,ksize,ptr,c,ip,jp
+      integer error,send_id,buffer_size
+
+      ksize = cell_size(3,c)-1
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,ksize,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(3),
+     >     BOTTOM+ip+jp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(jstart) for all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,j,n,ptr,c,kstart,ip,jp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      kstart = 0
+      ip = cell_coord(1,c)-1
+      jp = cell_coord(2,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,kstart,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(3), 
+     >     TOP+ip+jp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(ksize) for all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,n,ptr,c
+
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,j,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,ip,jp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(3), 
+     >     TOP+ip+jp*NCELLS, comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ip,jp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, predecessor(3), 
+     >     BOTTOM+ip+jp*NCELLS, comm_solve,
+     >     recv_id, error)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,kstart
+      
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+      if (last .eq. 0) then
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     U(jsize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) 
+     >                    - lhsc(m,n,i,j,ksize,c)*
+     >                    backsub_info(n,i,j,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=ksize-1,kstart,-1
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,kstart
+      double precision utmp(6,-2:KMAX+1)
+
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, ksize)
+
+      do j=start(2,c),jsize 
+         do i=start(1,c),isize
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians for cell c
+c---------------------------------------------------------------------
+            do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+               utmp(1,k) = 1.0d0 / u(1,i,j,k,c)
+               utmp(2,k) = u(2,i,j,k,c)
+               utmp(3,k) = u(3,i,j,k,c)
+               utmp(4,k) = u(4,i,j,k,c)
+               utmp(5,k) = u(5,i,j,k,c)
+               utmp(6,k) = qs(i,j,k,c)
+            end do
+
+            do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+
+               tmp1 = utmp(1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,k) = 0.0d+00
+               fjac(1,2,k) = 0.0d+00
+               fjac(1,3,k) = 0.0d+00
+               fjac(1,4,k) = 1.0d+00
+               fjac(1,5,k) = 0.0d+00
+
+               fjac(2,1,k) = - ( utmp(2,k)*utmp(4,k) ) 
+     >              * tmp2 
+               fjac(2,2,k) = utmp(4,k) * tmp1
+               fjac(2,3,k) = 0.0d+00
+               fjac(2,4,k) = utmp(2,k) * tmp1
+               fjac(2,5,k) = 0.0d+00
+
+               fjac(3,1,k) = - ( utmp(3,k)*utmp(4,k) )
+     >              * tmp2 
+               fjac(3,2,k) = 0.0d+00
+               fjac(3,3,k) = utmp(4,k) * tmp1
+               fjac(3,4,k) = utmp(3,k) * tmp1
+               fjac(3,5,k) = 0.0d+00
+
+               fjac(4,1,k) = - (utmp(4,k)*utmp(4,k) * tmp2 ) 
+     >              + c2 * utmp(6,k)
+               fjac(4,2,k) = - c2 *  utmp(2,k) * tmp1 
+               fjac(4,3,k) = - c2 *  utmp(3,k) * tmp1
+               fjac(4,4,k) = ( 2.0d+00 - c2 )
+     >              *  utmp(4,k) * tmp1 
+               fjac(4,5,k) = c2
+
+               fjac(5,1,k) = ( c2 * 2.0d0 * utmp(6,k)
+     >              - c1 * ( utmp(5,k) * tmp1 ) )
+     >              * ( utmp(4,k) * tmp1 )
+               fjac(5,2,k) = - c2 * ( utmp(2,k)*utmp(4,k) )
+     >              * tmp2 
+               fjac(5,3,k) = - c2 * ( utmp(3,k)*utmp(4,k) )
+     >              * tmp2
+               fjac(5,4,k) = c1 * ( utmp(5,k) * tmp1 )
+     >              - c2 * ( utmp(6,k)
+     >              + utmp(4,k)*utmp(4,k) * tmp2 )
+               fjac(5,5,k) = c1 * utmp(4,k) * tmp1
+
+               njac(1,1,k) = 0.0d+00
+               njac(1,2,k) = 0.0d+00
+               njac(1,3,k) = 0.0d+00
+               njac(1,4,k) = 0.0d+00
+               njac(1,5,k) = 0.0d+00
+
+               njac(2,1,k) = - c3c4 * tmp2 * utmp(2,k)
+               njac(2,2,k) =   c3c4 * tmp1
+               njac(2,3,k) =   0.0d+00
+               njac(2,4,k) =   0.0d+00
+               njac(2,5,k) =   0.0d+00
+
+               njac(3,1,k) = - c3c4 * tmp2 * utmp(3,k)
+               njac(3,2,k) =   0.0d+00
+               njac(3,3,k) =   c3c4 * tmp1
+               njac(3,4,k) =   0.0d+00
+               njac(3,5,k) =   0.0d+00
+
+               njac(4,1,k) = - con43 * c3c4 * tmp2 * utmp(4,k)
+               njac(4,2,k) =   0.0d+00
+               njac(4,3,k) =   0.0d+00
+               njac(4,4,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,k) =   0.0d+00
+
+               njac(5,1,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (utmp(2,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (utmp(3,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (utmp(4,k)**2)
+     >              - c1345 * tmp2 * utmp(5,k)
+
+               njac(5,2,k) = (  c3c4 - c1345 ) * tmp2 * utmp(2,k)
+               njac(5,3,k) = (  c3c4 - c1345 ) * tmp2 * utmp(3,k)
+               njac(5,4,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * utmp(4,k)
+               njac(5,5,k) = ( c1345 )* tmp1
+
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+            do k = start(3,c), ksize-end(3,c)
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhsa(1,1,k) = - tmp2 * fjac(1,1,k-1)
+     >              - tmp1 * njac(1,1,k-1)
+     >              - tmp1 * dz1 
+               lhsa(1,2,k) = - tmp2 * fjac(1,2,k-1)
+     >              - tmp1 * njac(1,2,k-1)
+               lhsa(1,3,k) = - tmp2 * fjac(1,3,k-1)
+     >              - tmp1 * njac(1,3,k-1)
+               lhsa(1,4,k) = - tmp2 * fjac(1,4,k-1)
+     >              - tmp1 * njac(1,4,k-1)
+               lhsa(1,5,k) = - tmp2 * fjac(1,5,k-1)
+     >              - tmp1 * njac(1,5,k-1)
+
+               lhsa(2,1,k) = - tmp2 * fjac(2,1,k-1)
+     >              - tmp1 * njac(2,1,k-1)
+               lhsa(2,2,k) = - tmp2 * fjac(2,2,k-1)
+     >              - tmp1 * njac(2,2,k-1)
+     >              - tmp1 * dz2
+               lhsa(2,3,k) = - tmp2 * fjac(2,3,k-1)
+     >              - tmp1 * njac(2,3,k-1)
+               lhsa(2,4,k) = - tmp2 * fjac(2,4,k-1)
+     >              - tmp1 * njac(2,4,k-1)
+               lhsa(2,5,k) = - tmp2 * fjac(2,5,k-1)
+     >              - tmp1 * njac(2,5,k-1)
+
+               lhsa(3,1,k) = - tmp2 * fjac(3,1,k-1)
+     >              - tmp1 * njac(3,1,k-1)
+               lhsa(3,2,k) = - tmp2 * fjac(3,2,k-1)
+     >              - tmp1 * njac(3,2,k-1)
+               lhsa(3,3,k) = - tmp2 * fjac(3,3,k-1)
+     >              - tmp1 * njac(3,3,k-1)
+     >              - tmp1 * dz3 
+               lhsa(3,4,k) = - tmp2 * fjac(3,4,k-1)
+     >              - tmp1 * njac(3,4,k-1)
+               lhsa(3,5,k) = - tmp2 * fjac(3,5,k-1)
+     >              - tmp1 * njac(3,5,k-1)
+
+               lhsa(4,1,k) = - tmp2 * fjac(4,1,k-1)
+     >              - tmp1 * njac(4,1,k-1)
+               lhsa(4,2,k) = - tmp2 * fjac(4,2,k-1)
+     >              - tmp1 * njac(4,2,k-1)
+               lhsa(4,3,k) = - tmp2 * fjac(4,3,k-1)
+     >              - tmp1 * njac(4,3,k-1)
+               lhsa(4,4,k) = - tmp2 * fjac(4,4,k-1)
+     >              - tmp1 * njac(4,4,k-1)
+     >              - tmp1 * dz4
+               lhsa(4,5,k) = - tmp2 * fjac(4,5,k-1)
+     >              - tmp1 * njac(4,5,k-1)
+
+               lhsa(5,1,k) = - tmp2 * fjac(5,1,k-1)
+     >              - tmp1 * njac(5,1,k-1)
+               lhsa(5,2,k) = - tmp2 * fjac(5,2,k-1)
+     >              - tmp1 * njac(5,2,k-1)
+               lhsa(5,3,k) = - tmp2 * fjac(5,3,k-1)
+     >              - tmp1 * njac(5,3,k-1)
+               lhsa(5,4,k) = - tmp2 * fjac(5,4,k-1)
+     >              - tmp1 * njac(5,4,k-1)
+               lhsa(5,5,k) = - tmp2 * fjac(5,5,k-1)
+     >              - tmp1 * njac(5,5,k-1)
+     >              - tmp1 * dz5
+
+               lhsb(1,1,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhsb(1,2,k) = tmp1 * 2.0d+00 * njac(1,2,k)
+               lhsb(1,3,k) = tmp1 * 2.0d+00 * njac(1,3,k)
+               lhsb(1,4,k) = tmp1 * 2.0d+00 * njac(1,4,k)
+               lhsb(1,5,k) = tmp1 * 2.0d+00 * njac(1,5,k)
+
+               lhsb(2,1,k) = tmp1 * 2.0d+00 * njac(2,1,k)
+               lhsb(2,2,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhsb(2,3,k) = tmp1 * 2.0d+00 * njac(2,3,k)
+               lhsb(2,4,k) = tmp1 * 2.0d+00 * njac(2,4,k)
+               lhsb(2,5,k) = tmp1 * 2.0d+00 * njac(2,5,k)
+
+               lhsb(3,1,k) = tmp1 * 2.0d+00 * njac(3,1,k)
+               lhsb(3,2,k) = tmp1 * 2.0d+00 * njac(3,2,k)
+               lhsb(3,3,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhsb(3,4,k) = tmp1 * 2.0d+00 * njac(3,4,k)
+               lhsb(3,5,k) = tmp1 * 2.0d+00 * njac(3,5,k)
+
+               lhsb(4,1,k) = tmp1 * 2.0d+00 * njac(4,1,k)
+               lhsb(4,2,k) = tmp1 * 2.0d+00 * njac(4,2,k)
+               lhsb(4,3,k) = tmp1 * 2.0d+00 * njac(4,3,k)
+               lhsb(4,4,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhsb(4,5,k) = tmp1 * 2.0d+00 * njac(4,5,k)
+
+               lhsb(5,1,k) = tmp1 * 2.0d+00 * njac(5,1,k)
+               lhsb(5,2,k) = tmp1 * 2.0d+00 * njac(5,2,k)
+               lhsb(5,3,k) = tmp1 * 2.0d+00 * njac(5,3,k)
+               lhsb(5,4,k) = tmp1 * 2.0d+00 * njac(5,4,k)
+               lhsb(5,5,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,k+1)
+     >              - tmp1 * njac(1,1,k+1)
+     >              - tmp1 * dz1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,k+1)
+     >              - tmp1 * njac(1,2,k+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,k+1)
+     >              - tmp1 * njac(1,3,k+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,k+1)
+     >              - tmp1 * njac(1,4,k+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,k+1)
+     >              - tmp1 * njac(1,5,k+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,k+1)
+     >              - tmp1 * njac(2,1,k+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,k+1)
+     >              - tmp1 * njac(2,2,k+1)
+     >              - tmp1 * dz2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,k+1)
+     >              - tmp1 * njac(2,3,k+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,k+1)
+     >              - tmp1 * njac(2,4,k+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,k+1)
+     >              - tmp1 * njac(2,5,k+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,k+1)
+     >              - tmp1 * njac(3,1,k+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,k+1)
+     >              - tmp1 * njac(3,2,k+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,k+1)
+     >              - tmp1 * njac(3,3,k+1)
+     >              - tmp1 * dz3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,k+1)
+     >              - tmp1 * njac(3,4,k+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,k+1)
+     >              - tmp1 * njac(3,5,k+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,k+1)
+     >              - tmp1 * njac(4,1,k+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,k+1)
+     >              - tmp1 * njac(4,2,k+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,k+1)
+     >              - tmp1 * njac(4,3,k+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,k+1)
+     >              - tmp1 * njac(4,4,k+1)
+     >              - tmp1 * dz4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,k+1)
+     >              - tmp1 * njac(4,5,k+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,k+1)
+     >              - tmp1 * njac(5,1,k+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,k+1)
+     >              - tmp1 * njac(5,2,k+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,k+1)
+     >              - tmp1 * njac(5,3,k+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,k+1)
+     >              - tmp1 * njac(5,4,k+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,k+1)
+     >              - tmp1 * njac(5,5,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,kstart) by b_inverse and copy back to c
+c     multiply rhs(kstart) by b_inverse(kstart) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,kstart),
+     >                        lhsc(1,1,i,j,kstart,c),
+     >                        rhs(1,i,j,kstart,c) )
+
+            endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do k=kstart+first,ksize-last
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,k),
+     >                         rhs(1,i,j,k-1,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,k),
+     >                         lhsc(1,1,i,j,k-1,c),
+     >                         lhsb(1,1,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,k),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,ksize),
+     >                         rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,ksize),
+     >                         lhsc(1,1,i,j,ksize-1,c),
+     >                         lhsb(1,1,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,ksize),
+     >                       rhs(1,i,j,ksize,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve_vec.f
new file mode 100644
index 0000000..bb84b0e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/BT/z_solve_vec.f
@@ -0,0 +1,803 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer c, kstart, stage,
+     >     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),
+     >     isize,jsize,ksize,send_id
+
+      kstart = 0
+
+      if (timeron) call timer_start(t_zsolve)
+c---------------------------------------------------------------------
+c     in our terminology stage is the number of the cell in the y-direct
+c     i.e. stage = 1 means the start of the line stage=ncells means end
+c---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(3,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+c---------------------------------------------------------------------
+c     set last-cell flag
+c---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+c---------------------------------------------------------------------
+c     This is the first cell, so solve without receiving data
+c---------------------------------------------------------------------
+            first = 1
+c            call lhsz(c)
+            call z_solve_cell(first,last,c)
+         else
+c---------------------------------------------------------------------
+c     Not the first cell of this line, so receive info from
+c     processor working on preceeding cell
+c---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_solve_info(recv_id,c)
+c---------------------------------------------------------------------
+c     overlap computations and communications
+c---------------------------------------------------------------------
+c            call lhsz(c)
+c---------------------------------------------------------------------
+c     wait for completion
+c---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+c---------------------------------------------------------------------
+c     install C'(kstart+1) and rhs'(kstart+1) to be used in this cell
+c---------------------------------------------------------------------
+            call z_unpack_solve_info(c)
+            call z_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call z_send_solve_info(send_id,c)
+      enddo
+
+c---------------------------------------------------------------------
+c     now perform backsubstitution in reverse direction
+c---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(3,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+c---------------------------------------------------------------------
+c     last cell, so perform back substitute without waiting
+c---------------------------------------------------------------------
+            call z_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+            call z_unpack_backsub_info(c)
+            call z_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call z_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine z_unpack_solve_info(c)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack C'(-1) and rhs'(-1) for
+c     all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,m,n,ptr,c,kstart 
+
+      kstart = 0
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine z_send_solve_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send C'(kend) and rhs'(kend) for
+c     all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,j,m,n,ksize,ptr,c,ip,jp
+      integer error,send_id,buffer_size
+
+      ksize = cell_size(3,c)-1
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+c---------------------------------------------------------------------
+c     pack up buffer
+c---------------------------------------------------------------------
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,ksize,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     send buffer 
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, successor(3),
+     >     BOTTOM+ip+jp*NCELLS, comm_solve,
+     >     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_send_backsub_info(send_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     pack up and send U(jstart) for all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer i,j,n,ptr,c,kstart,ip,jp
+      integer error,send_id,buffer_size
+
+c---------------------------------------------------------------------
+c     Send element 0 to previous processor
+c---------------------------------------------------------------------
+      kstart = 0
+      ip = cell_coord(1,c)-1
+      jp = cell_coord(2,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,kstart,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,
+     >     dp_type, predecessor(3), 
+     >     TOP+ip+jp*NCELLS, comm_solve, 
+     >     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_unpack_backsub_info(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     unpack U(ksize) for all i and j
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,n,ptr,c
+
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,j,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_receive_backsub_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer error,recv_id,ip,jp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, successor(3), 
+     >     TOP+ip+jp*NCELLS, comm_solve, 
+     >     recv_id, error)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_receive_solve_info(recv_id,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     post mpi receives 
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'mpinpb.h'
+
+      integer ip,jp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*
+     >     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,
+     >     dp_type, predecessor(3), 
+     >     BOTTOM+ip+jp*NCELLS, comm_solve,
+     >     recv_id, error)
+
+      return
+      end
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_backsubstitute(first, last, c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,kstart
+      
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+      if (last .eq. 0) then
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     U(jsize) uses info from previous cell if not last cell
+c---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) 
+     >                    - lhsc(m,n,i,j,ksize,c)*
+     >                    backsub_info(n,i,j,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=ksize-1,kstart,-1
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) 
+     >                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve_cell(first,last,c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,kstart
+
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+
+c---------------------------------------------------------------------
+c     zero the left hand side for starters
+c     set diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      do i = 0, isize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i,0) = 0.0d0
+               lhsb(m,n,i,0) = 0.0d0
+               lhsa(m,n,i,ksize) = 0.0d0
+               lhsb(m,n,i,ksize) = 0.0d0
+            enddo
+            lhsb(m,m,i,0) = 1.0d0
+            lhsb(m,m,i,ksize) = 1.0d0
+         enddo
+      enddo
+
+      do j=start(2,c),jsize 
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians for cell c
+c---------------------------------------------------------------------
+
+         do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+            do i=start(1,c),isize
+
+               tmp1 = 1.0d0 / u(1,i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,k) = 0.0d+00
+               fjac(1,2,i,k) = 0.0d+00
+               fjac(1,3,i,k) = 0.0d+00
+               fjac(1,4,i,k) = 1.0d+00
+               fjac(1,5,i,k) = 0.0d+00
+
+               fjac(2,1,i,k) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) 
+     >              * tmp2 
+               fjac(2,2,i,k) = u(4,i,j,k,c) * tmp1
+               fjac(2,3,i,k) = 0.0d+00
+               fjac(2,4,i,k) = u(2,i,j,k,c) * tmp1
+               fjac(2,5,i,k) = 0.0d+00
+
+               fjac(3,1,i,k) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) )
+     >              * tmp2 
+               fjac(3,2,i,k) = 0.0d+00
+               fjac(3,3,i,k) = u(4,i,j,k,c) * tmp1
+               fjac(3,4,i,k) = u(3,i,j,k,c) * tmp1
+               fjac(3,5,i,k) = 0.0d+00
+
+               fjac(4,1,i,k) = - (u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 ) 
+     >              + c2 * qs(i,j,k,c)
+               fjac(4,2,i,k) = - c2 *  u(2,i,j,k,c) * tmp1 
+               fjac(4,3,i,k) = - c2 *  u(3,i,j,k,c) * tmp1
+               fjac(4,4,i,k) = ( 2.0d+00 - c2 )
+     >              *  u(4,i,j,k,c) * tmp1 
+               fjac(4,5,i,k) = c2
+
+               fjac(5,1,i,k) = ( c2 * 2.0d0 * qs(i,j,k,c)
+     >              - c1 * ( u(5,i,j,k,c) * tmp1 ) )
+     >              * ( u(4,i,j,k,c) * tmp1 )
+               fjac(5,2,i,k) = - c2 * ( u(2,i,j,k,c)*u(4,i,j,k,c) )
+     >              * tmp2 
+               fjac(5,3,i,k) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) )
+     >              * tmp2
+               fjac(5,4,i,k) = c1 * ( u(5,i,j,k,c) * tmp1 )
+     >              - c2 * ( qs(i,j,k,c)
+     >              + u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 )
+               fjac(5,5,i,k) = c1 * u(4,i,j,k,c) * tmp1
+
+               njac(1,1,i,k) = 0.0d+00
+               njac(1,2,i,k) = 0.0d+00
+               njac(1,3,i,k) = 0.0d+00
+               njac(1,4,i,k) = 0.0d+00
+               njac(1,5,i,k) = 0.0d+00
+
+               njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,k) =   c3c4 * tmp1
+               njac(2,3,i,k) =   0.0d+00
+               njac(2,4,i,k) =   0.0d+00
+               njac(2,5,i,k) =   0.0d+00
+
+               njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,k) =   0.0d+00
+               njac(3,3,i,k) =   c3c4 * tmp1
+               njac(3,4,i,k) =   0.0d+00
+               njac(3,5,i,k) =   0.0d+00
+
+               njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,k) =   0.0d+00
+               njac(4,3,i,k) =   0.0d+00
+               njac(4,4,i,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,i,k) =   0.0d+00
+
+               njac(5,1,i,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,k) = ( c1345 )* tmp1
+
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+         do k = start(3,c), ksize-end(3,c)
+            do i=start(1,c),isize
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhsa(1,1,i,k) = - tmp2 * fjac(1,1,i,k-1)
+     >              - tmp1 * njac(1,1,i,k-1)
+     >              - tmp1 * dz1 
+               lhsa(1,2,i,k) = - tmp2 * fjac(1,2,i,k-1)
+     >              - tmp1 * njac(1,2,i,k-1)
+               lhsa(1,3,i,k) = - tmp2 * fjac(1,3,i,k-1)
+     >              - tmp1 * njac(1,3,i,k-1)
+               lhsa(1,4,i,k) = - tmp2 * fjac(1,4,i,k-1)
+     >              - tmp1 * njac(1,4,i,k-1)
+               lhsa(1,5,i,k) = - tmp2 * fjac(1,5,i,k-1)
+     >              - tmp1 * njac(1,5,i,k-1)
+
+               lhsa(2,1,i,k) = - tmp2 * fjac(2,1,i,k-1)
+     >              - tmp1 * njac(2,1,i,k-1)
+               lhsa(2,2,i,k) = - tmp2 * fjac(2,2,i,k-1)
+     >              - tmp1 * njac(2,2,i,k-1)
+     >              - tmp1 * dz2
+               lhsa(2,3,i,k) = - tmp2 * fjac(2,3,i,k-1)
+     >              - tmp1 * njac(2,3,i,k-1)
+               lhsa(2,4,i,k) = - tmp2 * fjac(2,4,i,k-1)
+     >              - tmp1 * njac(2,4,i,k-1)
+               lhsa(2,5,i,k) = - tmp2 * fjac(2,5,i,k-1)
+     >              - tmp1 * njac(2,5,i,k-1)
+
+               lhsa(3,1,i,k) = - tmp2 * fjac(3,1,i,k-1)
+     >              - tmp1 * njac(3,1,i,k-1)
+               lhsa(3,2,i,k) = - tmp2 * fjac(3,2,i,k-1)
+     >              - tmp1 * njac(3,2,i,k-1)
+               lhsa(3,3,i,k) = - tmp2 * fjac(3,3,i,k-1)
+     >              - tmp1 * njac(3,3,i,k-1)
+     >              - tmp1 * dz3 
+               lhsa(3,4,i,k) = - tmp2 * fjac(3,4,i,k-1)
+     >              - tmp1 * njac(3,4,i,k-1)
+               lhsa(3,5,i,k) = - tmp2 * fjac(3,5,i,k-1)
+     >              - tmp1 * njac(3,5,i,k-1)
+
+               lhsa(4,1,i,k) = - tmp2 * fjac(4,1,i,k-1)
+     >              - tmp1 * njac(4,1,i,k-1)
+               lhsa(4,2,i,k) = - tmp2 * fjac(4,2,i,k-1)
+     >              - tmp1 * njac(4,2,i,k-1)
+               lhsa(4,3,i,k) = - tmp2 * fjac(4,3,i,k-1)
+     >              - tmp1 * njac(4,3,i,k-1)
+               lhsa(4,4,i,k) = - tmp2 * fjac(4,4,i,k-1)
+     >              - tmp1 * njac(4,4,i,k-1)
+     >              - tmp1 * dz4
+               lhsa(4,5,i,k) = - tmp2 * fjac(4,5,i,k-1)
+     >              - tmp1 * njac(4,5,i,k-1)
+
+               lhsa(5,1,i,k) = - tmp2 * fjac(5,1,i,k-1)
+     >              - tmp1 * njac(5,1,i,k-1)
+               lhsa(5,2,i,k) = - tmp2 * fjac(5,2,i,k-1)
+     >              - tmp1 * njac(5,2,i,k-1)
+               lhsa(5,3,i,k) = - tmp2 * fjac(5,3,i,k-1)
+     >              - tmp1 * njac(5,3,i,k-1)
+               lhsa(5,4,i,k) = - tmp2 * fjac(5,4,i,k-1)
+     >              - tmp1 * njac(5,4,i,k-1)
+               lhsa(5,5,i,k) = - tmp2 * fjac(5,5,i,k-1)
+     >              - tmp1 * njac(5,5,i,k-1)
+     >              - tmp1 * dz5
+
+               lhsb(1,1,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhsb(1,2,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k)
+               lhsb(1,3,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k)
+               lhsb(1,4,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k)
+               lhsb(1,5,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k)
+
+               lhsb(2,1,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k)
+               lhsb(2,2,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhsb(2,3,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k)
+               lhsb(2,4,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k)
+               lhsb(2,5,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k)
+
+               lhsb(3,1,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k)
+               lhsb(3,2,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k)
+               lhsb(3,3,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhsb(3,4,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k)
+               lhsb(3,5,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k)
+
+               lhsb(4,1,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k)
+               lhsb(4,2,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k)
+               lhsb(4,3,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k)
+               lhsb(4,4,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhsb(4,5,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k)
+
+               lhsb(5,1,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k)
+               lhsb(5,2,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k)
+               lhsb(5,3,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k)
+               lhsb(5,4,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k)
+               lhsb(5,5,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i,k+1)
+     >              - tmp1 * njac(1,1,i,k+1)
+     >              - tmp1 * dz1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i,k+1)
+     >              - tmp1 * njac(1,2,i,k+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i,k+1)
+     >              - tmp1 * njac(1,3,i,k+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i,k+1)
+     >              - tmp1 * njac(1,4,i,k+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i,k+1)
+     >              - tmp1 * njac(1,5,i,k+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i,k+1)
+     >              - tmp1 * njac(2,1,i,k+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i,k+1)
+     >              - tmp1 * njac(2,2,i,k+1)
+     >              - tmp1 * dz2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i,k+1)
+     >              - tmp1 * njac(2,3,i,k+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i,k+1)
+     >              - tmp1 * njac(2,4,i,k+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i,k+1)
+     >              - tmp1 * njac(2,5,i,k+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i,k+1)
+     >              - tmp1 * njac(3,1,i,k+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i,k+1)
+     >              - tmp1 * njac(3,2,i,k+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i,k+1)
+     >              - tmp1 * njac(3,3,i,k+1)
+     >              - tmp1 * dz3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i,k+1)
+     >              - tmp1 * njac(3,4,i,k+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i,k+1)
+     >              - tmp1 * njac(3,5,i,k+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i,k+1)
+     >              - tmp1 * njac(4,1,i,k+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i,k+1)
+     >              - tmp1 * njac(4,2,i,k+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i,k+1)
+     >              - tmp1 * njac(4,3,i,k+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i,k+1)
+     >              - tmp1 * njac(4,4,i,k+1)
+     >              - tmp1 * dz4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i,k+1)
+     >              - tmp1 * njac(4,5,i,k+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i,k+1)
+     >              - tmp1 * njac(5,1,i,k+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i,k+1)
+     >              - tmp1 * njac(5,2,i,k+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i,k+1)
+     >              - tmp1 * njac(5,3,i,k+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i,k+1)
+     >              - tmp1 * njac(5,4,i,k+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i,k+1)
+     >              - tmp1 * njac(5,5,i,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,kstart) by b_inverse and copy back to c
+c     multiply rhs(kstart) by b_inverse(kstart) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+            do i=start(1,c),isize
+               call binvcrhs( lhsb(1,1,i,kstart),
+     >                        lhsc(1,1,i,j,kstart,c),
+     >                        rhs(1,i,j,kstart,c) )
+            enddo
+
+         endif
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do k=kstart+first,ksize-last
+!dir$ ivdep
+            do i=start(1,c),isize
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,k),
+     >                         rhs(1,i,j,k-1,c),rhs(1,i,j,k,c))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,k),
+     >                         lhsc(1,1,i,j,k-1,c),
+     >                         lhsb(1,1,i,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,k),
+     >                        lhsc(1,1,i,j,k,c),
+     >                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do i=start(1,c),isize
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,ksize),
+     >                         rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize,c)
+c---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,ksize),
+     >                         lhsc(1,1,i,j,ksize-1,c),
+     >                         lhsb(1,1,i,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,i,ksize),
+     >                       rhs(1,i,j,ksize,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/Makefile
new file mode 100644
index 0000000..e9f0c98
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=cg
+BENCHMARKU=CG
+
+include ../config/make.def
+
+OBJS = cg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+cg.o:		cg.f  mpinpb.h npbparams.h timing.h
+	${FCOMPILE} cg.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/cg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/cg.f
new file mode 100644
index 0000000..9a82466
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/cg.f
@@ -0,0 +1,1864 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   C G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: M. Yarrow
+c          C. Kuszmaul
+c          R. F. Van der Wijngaart
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      program cg
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'timing.h'
+      integer status(MPI_STATUS_SIZE), request, ierr
+
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c  num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows.
+c  num_proc_cols and num_proc_cols are to be found in npbparams.h.
+c  When num_procs is not square, then num_proc_cols must be = 2*num_proc_rows.
+c---------------------------------------------------------------------
+      integer    num_procs 
+      parameter( num_procs = num_proc_cols * num_proc_rows )
+
+
+
+c---------------------------------------------------------------------
+c  Class specific parameters: 
+c  It appears here for reference only.
+c  These are their values, however, this info is imported in the npbparams.h
+c  include file, which is written by the sys/setparams.c program.
+c---------------------------------------------------------------------
+
+C----------
+C  Class S:
+C----------
+CC       parameter( na=1400, 
+CC      >           nonzer=7, 
+CC      >           shift=10., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class W:
+C----------
+CC       parameter( na=7000,
+CC      >           nonzer=8, 
+CC      >           shift=12., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class A:
+C----------
+CC       parameter( na=14000,
+CC      >           nonzer=11, 
+CC      >           shift=20., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class B:
+C----------
+CC       parameter( na=75000, 
+CC      >           nonzer=13, 
+CC      >           shift=60., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class C:
+C----------
+CC       parameter( na=150000, 
+CC      >           nonzer=15, 
+CC      >           shift=110., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class D:
+C----------
+CC       parameter( na=1500000, 
+CC      >           nonzer=21, 
+CC      >           shift=500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class E:
+C----------
+CC       parameter( na=9000000, 
+CC      >           nonzer=26, 
+CC      >           shift=1500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+
+
+
+      integer    nz
+      parameter( nz = na*(nonzer+1)/num_procs*(nonzer+1)+nonzer
+     >              + na*(nonzer+2+num_procs/256)/num_proc_cols )
+
+
+
+      common / partit_size  /  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+      integer                  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+
+
+      common / main_int_mem /  colidx,     rowstr,
+     >                         iv,         arow,     acol
+      integer                  colidx(nz), rowstr(na+1),
+     >                         iv(2*na+1), arow(nz), acol(nz)
+
+
+      common / main_flt_mem /  v,       aelt,     a,
+     >                         x,
+     >                         z,
+     >                         p,
+     >                         q,
+     >                         r,
+     >                         w
+      double precision         v(na+1), aelt(nz), a(nz),
+     >                         x(na/num_proc_rows+2),
+     >                         z(na/num_proc_rows+2),
+     >                         p(na/num_proc_rows+2),
+     >                         q(na/num_proc_rows+2),
+     >                         r(na/num_proc_rows+2),
+     >                         w(na/num_proc_rows+2)
+
+
+      common /urando/          amult, tran
+      double precision         amult, tran
+
+
+
+      integer            l2npcols
+      integer            reduce_exch_proc(num_proc_cols)
+      integer            reduce_send_starts(num_proc_cols)
+      integer            reduce_send_lengths(num_proc_cols)
+      integer            reduce_recv_starts(num_proc_cols)
+      integer            reduce_recv_lengths(num_proc_cols)
+
+      integer            i, j, k, it
+
+      double precision   zeta, randlc
+      external           randlc
+      double precision   rnorm
+      double precision   norm_temp1(2), norm_temp2(2)
+
+      double precision   t, tmax, mflops
+      external           timer_read
+      double precision   timer_read
+      character          class
+      logical            verified
+      double precision   zeta_verify_value, epsilon, err
+
+      double precision tsum(t_last+2), t1(t_last+2),
+     >                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'conjg', 'rcomm', 'ncomm',
+     >            ' totcomp', ' totcomm'/
+
+
+c---------------------------------------------------------------------
+c  Set up mpi initialization and number of proc testing
+c---------------------------------------------------------------------
+      call initialize_mpi
+
+
+      if( na .eq. 1400 .and. 
+     &    nonzer .eq. 7 .and. 
+     &    niter .eq. 15 .and.
+     &    shift .eq. 10.d0 ) then
+         class = 'S'
+         zeta_verify_value = 8.5971775078648d0
+      else if( na .eq. 7000 .and. 
+     &         nonzer .eq. 8 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 12.d0 ) then
+         class = 'W'
+         zeta_verify_value = 10.362595087124d0
+      else if( na .eq. 14000 .and. 
+     &         nonzer .eq. 11 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 20.d0 ) then
+         class = 'A'
+         zeta_verify_value = 17.130235054029d0
+      else if( na .eq. 75000 .and. 
+     &         nonzer .eq. 13 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 60.d0 ) then
+         class = 'B'
+         zeta_verify_value = 22.712745482631d0
+      else if( na .eq. 150000 .and. 
+     &         nonzer .eq. 15 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 110.d0 ) then
+         class = 'C'
+         zeta_verify_value = 28.973605592845d0
+      else if( na .eq. 1500000 .and. 
+     &         nonzer .eq. 21 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 500.d0 ) then
+         class = 'D'
+         zeta_verify_value = 52.514532105794d0
+      else if( na .eq. 9000000 .and. 
+     &         nonzer .eq. 26 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 1.5d3 ) then
+         class = 'E'
+         zeta_verify_value = 77.522164599383d0
+      else
+         class = 'U'
+      endif
+
+      if( me .eq. root )then
+         write( *,1000 ) 
+         write( *,1001 ) na
+         write( *,1002 ) niter
+         write( *,1003 ) nprocs
+         write( *,1004 ) nonzer
+         write( *,1005 ) shift
+ 1000 format(//,' NAS Parallel Benchmarks 3.3 -- CG Benchmark', /)
+ 1001 format(' Size: ', i10 )
+ 1002 format(' Iterations: ', i5 )
+ 1003 format(' Number of active processes: ', i5 )
+ 1004 format(' Number of nonzeroes per row: ', i8)
+ 1005 format(' Eigenvalue shift: ', e8.3)
+      endif
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+
+      naa = na
+      nzz = nz
+
+
+c---------------------------------------------------------------------
+c  Set up processor info, such as whether sq num of procs, etc
+c---------------------------------------------------------------------
+      call setup_proc_info( num_procs, 
+     >                      num_proc_rows, 
+     >                      num_proc_cols )
+
+
+c---------------------------------------------------------------------
+c  Set up partition's submatrix info: firstcol, lastcol, firstrow, lastrow
+c---------------------------------------------------------------------
+      call setup_submatrix_info( l2npcols,
+     >                           reduce_exch_proc,
+     >                           reduce_send_starts,
+     >                           reduce_send_lengths,
+     >                           reduce_recv_starts,
+     >                           reduce_recv_lengths )
+
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c  Inialize random number generator
+c---------------------------------------------------------------------
+      tran    = 314159265.0D0
+      amult   = 1220703125.0D0
+      zeta    = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c  Set up partition's sparse random matrix for given class size
+c---------------------------------------------------------------------
+      call makea(naa, nzz, a, colidx, rowstr, nonzer,
+     >           firstrow, lastrow, firstcol, lastcol, 
+     >           rcond, arow, acol, aelt, v, iv, shift)
+
+
+
+c---------------------------------------------------------------------
+c  Note: as a result of the above call to makea:
+c        values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1
+c        values of colidx which are col indexes go from firstcol --> lastcol
+c        So:
+c        Shift the col index vals from actual (firstcol --> lastcol ) 
+c        to local, i.e., (1 --> lastcol-firstcol+1)
+c---------------------------------------------------------------------
+      do j=1,lastrow-firstrow+1
+         do k=rowstr(j),rowstr(j+1)-1
+            colidx(k) = colidx(k) - firstcol + 1
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+      do i = 1, na/num_proc_rows+1
+         x(i) = 1.0D0
+      enddo
+
+      zeta  = 0.0d0
+
+c---------------------------------------------------------------------
+c---->
+c  Do one iteration untimed to init all code and data page tables
+c---->                    (then reinit, start timing, to niter its)
+c---------------------------------------------------------------------
+      do it = 1, 1
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    w,
+     >                    rnorm, 
+     >                    l2npcols,
+     >                    reduce_exch_proc,
+     >                    reduce_send_starts,
+     >                    reduce_send_lengths,
+     >                    reduce_recv_starts,
+     >                    reduce_recv_lengths )
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1(1) = 0.0d0
+         norm_temp1(2) = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1(1) = norm_temp1(1) + x(j)*z(j)
+            norm_temp1(2) = norm_temp1(2) + z(j)*z(j)
+         enddo
+
+         do i = 1, l2npcols
+            if (timeron) call timer_start(t_ncomm)
+            call mpi_irecv( norm_temp2,
+     >                      2, 
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+            call mpi_send(  norm_temp1,
+     >                      2, 
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_ncomm)
+
+            norm_temp1(1) = norm_temp1(1) + norm_temp2(1)
+            norm_temp1(2) = norm_temp1(2) + norm_temp2(2)
+         enddo
+
+         norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) )
+
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp1(2)*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of do one iteration untimed
+
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+c
+c  NOTE: a questionable limit on size:  should this be na/num_proc_cols+1 ?
+c
+      do i = 1, na/num_proc_rows+1
+         x(i) = 1.0D0
+      enddo
+
+      zeta  = 0.0d0
+
+c---------------------------------------------------------------------
+c  Synchronize and start timing
+c---------------------------------------------------------------------
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call mpi_barrier( mpi_comm_world,
+     >                  ierr )
+
+      call timer_clear( 1 )
+      call timer_start( 1 )
+
+c---------------------------------------------------------------------
+c---->
+c  Main Iteration for inverse power method
+c---->
+c---------------------------------------------------------------------
+      do it = 1, niter
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    w,
+     >                    rnorm, 
+     >                    l2npcols,
+     >                    reduce_exch_proc,
+     >                    reduce_send_starts,
+     >                    reduce_send_lengths,
+     >                    reduce_recv_starts,
+     >                    reduce_recv_lengths )
+
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1(1) = 0.0d0
+         norm_temp1(2) = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1(1) = norm_temp1(1) + x(j)*z(j)
+            norm_temp1(2) = norm_temp1(2) + z(j)*z(j)
+         enddo
+
+         do i = 1, l2npcols
+            if (timeron) call timer_start(t_ncomm)
+            call mpi_irecv( norm_temp2,
+     >                      2, 
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+            call mpi_send(  norm_temp1,
+     >                      2, 
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_ncomm)
+
+            norm_temp1(1) = norm_temp1(1) + norm_temp2(1)
+            norm_temp1(2) = norm_temp1(2) + norm_temp2(2)
+         enddo
+
+         norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) )
+
+
+         if( me .eq. root )then
+            zeta = shift + 1.0d0 / norm_temp1(1)
+            if( it .eq. 1 ) write( *,9000 )
+            write( *,9001 ) it, rnorm, zeta
+         endif
+ 9000 format( /,'   iteration           ||r||                 zeta' )
+ 9001 format( 4x, i5, 7x, e20.14, f20.13 )
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp1(2)*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of main iter inv pow meth
+
+      call timer_stop( 1 )
+
+c---------------------------------------------------------------------
+c  End of timed section
+c---------------------------------------------------------------------
+
+      t = timer_read( 1 )
+
+      call mpi_reduce( t,
+     >                 tmax,
+     >                 1, 
+     >                 dp_type,
+     >                 MPI_MAX,
+     >                 root,
+     >                 mpi_comm_world,
+     >                 ierr )
+
+      if( me .eq. root )then
+         write(*,100)
+ 100     format(' Benchmark completed ')
+
+         epsilon = 1.d-10
+         if (class .ne. 'U') then
+
+            err = abs( zeta - zeta_verify_value )/zeta_verify_value
+            if( err .le. epsilon ) then
+               verified = .TRUE.
+               write(*, 200)
+               write(*, 201) zeta
+               write(*, 202) err
+ 200           format(' VERIFICATION SUCCESSFUL ')
+ 201           format(' Zeta is    ', E20.13)
+ 202           format(' Error is   ', E20.13)
+            else
+               verified = .FALSE.
+               write(*, 300) 
+               write(*, 301) zeta
+               write(*, 302) zeta_verify_value
+ 300           format(' VERIFICATION FAILED')
+ 301           format(' Zeta                ', E20.13)
+ 302           format(' The correct zeta is ', E20.13)
+            endif
+         else
+            verified = .FALSE.
+            write (*, 400)
+            write (*, 401)
+            write (*, 201) zeta
+ 400        format(' Problem size unknown')
+ 401        format(' NO VERIFICATION PERFORMED')
+         endif
+
+
+         if( tmax .ne. 0. ) then
+            mflops = float( 2*niter*na )
+     &                  * ( 3.+float( nonzer*(nonzer+1) )
+     &                    + 25.*(5.+float( nonzer*(nonzer+1) ))
+     &                    + 3. ) / tmax / 1000000.0
+         else
+            mflops = 0.0
+         endif
+
+         call print_results('CG', class, na, 0, 0,
+     >                      niter, nnodes_compiled, nprocs, tmax,
+     >                      mflops, '          floating point', 
+     >                      verified, npbversion, compiletime,
+     >                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      endif
+
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_conjg) = t1(t_conjg) - t1(t_rcomm)
+      t1(t_last+2) = t1(t_rcomm) + t1(t_ncomm)
+      t1(t_last+1) = t1(t_total) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                0, MPI_COMM_WORLD, ierr)
+
+      if (me .eq. 0) then
+         write(*, 800) nprocs
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / nprocs
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+
+
+
+      end                              ! end main
+
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine initialize_mpi
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'timing.h'
+
+      integer   ierr, fstatus
+
+
+      call mpi_init( ierr )
+      call mpi_comm_rank( mpi_comm_world, me, ierr )
+      call mpi_comm_size( mpi_comm_world, nprocs, ierr )
+      root = 0
+
+      if (me .eq. root) then
+         open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+         timeron = .false.
+         if (fstatus .eq. 0) then
+            timeron = .true.
+            close(2)
+         endif
+      endif
+
+      call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, mpi_comm_world, ierr)
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine setup_proc_info( num_procs, 
+     >                            num_proc_rows, 
+     >                            num_proc_cols )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+
+      common / partit_size  /  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+      integer                  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+
+      integer   num_procs, num_proc_cols, num_proc_rows
+      integer   i, ierr
+      integer   log2nprocs
+
+c---------------------------------------------------------------------
+c  num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows
+c  When num_procs is not square, then num_proc_cols = 2*num_proc_rows
+c---------------------------------------------------------------------
+c  First, number of procs must be power of two. 
+c---------------------------------------------------------------------
+      if( nprocs .ne. num_procs )then
+          if( me .eq. root ) write( *,9000 ) nprocs, num_procs
+ 9000     format(      /,'Error: ',/,'num of procs allocated   (', 
+     >                 i4, ' )',
+     >                 /,'is not equal to',/,
+     >                 'compiled number of procs (',
+     >                 i4, ' )',/   )
+          call mpi_finalize(ierr)
+          stop
+      endif
+
+
+      i = num_proc_cols
+ 100  continue
+          if( i .ne. 1 .and. i/2*2 .ne. i )then
+              if ( me .eq. root ) then  
+                 write( *,* ) 'Error: num_proc_cols is ',
+     >                         num_proc_cols,
+     >                        ' which is not a power of two'
+              endif
+              call mpi_finalize(ierr)
+              stop
+          endif
+          i = i / 2
+          if( i .ne. 0 )then
+              goto 100
+          endif
+      
+      i = num_proc_rows
+ 200  continue
+          if( i .ne. 1 .and. i/2*2 .ne. i )then
+              if ( me .eq. root ) then 
+                 write( *,* ) 'Error: num_proc_rows is ',
+     >                         num_proc_rows,
+     >                        ' which is not a power of two'
+              endif
+              call mpi_finalize(ierr)
+              stop
+          endif
+          i = i / 2
+          if( i .ne. 0 )then
+              goto 200
+          endif
+      
+      log2nprocs = 0
+      i = nprocs
+ 300  continue
+          if( i .ne. 1 .and. i/2*2 .ne. i )then
+              write( *,* ) 'Error: nprocs is ',
+     >                      nprocs,
+     >                      ' which is not a power of two'
+              call mpi_finalize(ierr)
+              stop
+          endif
+          i = i / 2
+          if( i .ne. 0 )then
+              log2nprocs = log2nprocs + 1
+              goto 300
+          endif
+
+CC       write( *,* ) 'nprocs, log2nprocs: ',nprocs,log2nprocs
+
+      
+      npcols = num_proc_cols
+      nprows = num_proc_rows
+
+
+      return
+      end
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine setup_submatrix_info( l2npcols,
+     >                                 reduce_exch_proc,
+     >                                 reduce_send_starts,
+     >                                 reduce_send_lengths,
+     >                                 reduce_recv_starts,
+     >                                 reduce_recv_lengths )
+     >                                 
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer      col_size, row_size
+
+      common / partit_size  /  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+      integer                  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+
+      integer   reduce_exch_proc(*)
+      integer   reduce_send_starts(*)
+      integer   reduce_send_lengths(*)
+      integer   reduce_recv_starts(*)
+      integer   reduce_recv_lengths(*)
+
+      integer   i, j
+      integer   div_factor
+      integer   l2npcols
+
+
+      proc_row = me / npcols
+      proc_col = me - proc_row*npcols
+
+
+
+c---------------------------------------------------------------------
+c  If naa evenly divisible by npcols, then it is evenly divisible 
+c  by nprows 
+c---------------------------------------------------------------------
+
+      if( naa/npcols*npcols .eq. naa )then
+          col_size = naa/npcols
+          firstcol = proc_col*col_size + 1
+          lastcol  = firstcol - 1 + col_size
+          row_size = naa/nprows
+          firstrow = proc_row*row_size + 1
+          lastrow  = firstrow - 1 + row_size
+c---------------------------------------------------------------------
+c  If naa not evenly divisible by npcols, then first subdivide for nprows
+c  and then, if npcols not equal to nprows (i.e., not a sq number of procs), 
+c  get col subdivisions by dividing by 2 each row subdivision.
+c---------------------------------------------------------------------
+      else
+          if( proc_row .lt. naa - naa/nprows*nprows)then
+              row_size = naa/nprows+ 1
+              firstrow = proc_row*row_size + 1
+              lastrow  = firstrow - 1 + row_size
+          else
+              row_size = naa/nprows
+              firstrow = (naa - naa/nprows*nprows)*(row_size+1)
+     >                 + (proc_row-(naa-naa/nprows*nprows))
+     >                     *row_size + 1
+              lastrow  = firstrow - 1 + row_size
+          endif
+          if( npcols .eq. nprows )then
+              if( proc_col .lt. naa - naa/npcols*npcols )then
+                  col_size = naa/npcols+ 1
+                  firstcol = proc_col*col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              else
+                  col_size = naa/npcols
+                  firstcol = (naa - naa/npcols*npcols)*(col_size+1)
+     >                     + (proc_col-(naa-naa/npcols*npcols))
+     >                         *col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              endif
+          else
+              if( (proc_col/2) .lt. 
+     >                           naa - naa/(npcols/2)*(npcols/2) )then
+                  col_size = naa/(npcols/2) + 1
+                  firstcol = (proc_col/2)*col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              else
+                  col_size = naa/(npcols/2)
+                  firstcol = (naa - naa/(npcols/2)*(npcols/2))
+     >                                                 *(col_size+1)
+     >               + ((proc_col/2)-(naa-naa/(npcols/2)*(npcols/2)))
+     >                         *col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              endif
+CC               write( *,* ) col_size,firstcol,lastcol
+              if( mod( me,2 ) .eq. 0 )then
+                  lastcol  = firstcol - 1 + (col_size-1)/2 + 1
+              else
+                  firstcol = firstcol + (col_size-1)/2 + 1
+                  lastcol  = firstcol - 1 + col_size/2
+CC                   write( *,* ) firstcol,lastcol
+              endif
+          endif
+      endif
+
+
+
+      if( npcols .eq. nprows )then
+          send_start = 1
+          send_len   = lastrow - firstrow + 1
+      else
+          if( mod( me,2 ) .eq. 0 )then
+              send_start = 1
+              send_len   = (1 + lastrow-firstrow+1)/2
+          else
+              send_start = (1 + lastrow-firstrow+1)/2 + 1
+              send_len   = (lastrow-firstrow+1)/2
+          endif
+      endif
+          
+
+
+
+c---------------------------------------------------------------------
+c  Transpose exchange processor
+c---------------------------------------------------------------------
+
+      if( npcols .eq. nprows )then
+          exch_proc = mod( me,nprows )*nprows + me/nprows
+      else
+          exch_proc = 2*(mod( me/2,nprows )*nprows + me/2/nprows)
+     >                 + mod( me,2 )
+      endif
+
+
+
+      i = npcols / 2
+      l2npcols = 0
+      do while( i .gt. 0 )
+         l2npcols = l2npcols + 1
+         i = i / 2
+      enddo
+
+
+c---------------------------------------------------------------------
+c  Set up the reduce phase schedules...
+c---------------------------------------------------------------------
+
+      div_factor = npcols
+      do i = 1, l2npcols
+
+         j = mod( proc_col+div_factor/2, div_factor )
+     >     + proc_col / div_factor * div_factor
+         reduce_exch_proc(i) = proc_row*npcols + j
+
+         div_factor = div_factor / 2
+
+      enddo
+
+
+      do i = l2npcols, 1, -1
+
+            if( nprows .eq. npcols )then
+               reduce_send_starts(i)  = send_start
+               reduce_send_lengths(i) = send_len
+               reduce_recv_lengths(i) = lastrow - firstrow + 1
+            else
+               reduce_recv_lengths(i) = send_len
+               if( i .eq. l2npcols )then
+                  reduce_send_lengths(i) = lastrow-firstrow+1 - send_len
+                  if( me/2*2 .eq. me )then
+                     reduce_send_starts(i) = send_start + send_len
+                  else
+                     reduce_send_starts(i) = 1
+                  endif
+               else
+                  reduce_send_lengths(i) = send_len
+                  reduce_send_starts(i)  = send_start
+               endif
+            endif
+            reduce_recv_starts(i) = send_start
+
+      enddo
+
+
+      exch_recv_length = lastcol - firstcol + 1
+
+
+      return
+      end
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine conj_grad ( colidx,
+     >                       rowstr,
+     >                       x,
+     >                       z,
+     >                       a,
+     >                       p,
+     >                       q,
+     >                       r,
+     >                       w,
+     >                       rnorm, 
+     >                       l2npcols,
+     >                       reduce_exch_proc,
+     >                       reduce_send_starts,
+     >                       reduce_send_lengths,
+     >                       reduce_recv_starts,
+     >                       reduce_recv_lengths )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  Floaging point arrays here are named as in NPB1 spec discussion of 
+c  CG algorithm
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+      include 'mpinpb.h'
+      include 'timing.h'
+
+      integer status(MPI_STATUS_SIZE ), request
+
+
+      common / partit_size  /  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+      integer                  naa, nzz, 
+     >                         npcols, nprows,
+     >                         proc_col, proc_row,
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol,
+     >                         exch_proc,
+     >                         exch_recv_length,
+     >                         send_start,
+     >                         send_len
+
+
+
+      double precision   x(*),
+     >                   z(*),
+     >                   a(nzz)
+      integer            colidx(nzz), rowstr(naa+1)
+
+      double precision   p(*),
+     >                   q(*),
+     >                   r(*),               
+     >                   w(*)                ! used as work temporary
+
+      integer   l2npcols
+      integer   reduce_exch_proc(l2npcols)
+      integer   reduce_send_starts(l2npcols)
+      integer   reduce_send_lengths(l2npcols)
+      integer   reduce_recv_starts(l2npcols)
+      integer   reduce_recv_lengths(l2npcols)
+
+      integer   i, j, k, ierr
+      integer   cgit, cgitmax
+
+      double precision   d, sum, rho, rho0, alpha, beta, rnorm
+
+      external         timer_read
+      double precision timer_read
+
+      data      cgitmax / 25 /
+
+
+      if (timeron) call timer_start(t_conjg)
+c---------------------------------------------------------------------
+c  Initialize the CG algorithm:
+c---------------------------------------------------------------------
+      do j=1,naa/nprows+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = x(j)
+         p(j) = r(j)
+         w(j) = 0.0d0                 
+      enddo
+
+
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+      sum = 0.0d0
+      do j=1, lastcol-firstcol+1
+         sum = sum + r(j)*r(j)
+      enddo
+
+c---------------------------------------------------------------------
+c  Exchange and sum with procs identified in reduce_exch_proc
+c  (This is equivalent to mpi_allreduce.)
+c  Sum the partial sums of rho, leaving rho on all processors
+c---------------------------------------------------------------------
+      do i = 1, l2npcols
+         if (timeron) call timer_start(t_rcomm)
+         call mpi_irecv( rho,
+     >                   1,
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   request,
+     >                   ierr )
+         call mpi_send(  sum,
+     >                   1,
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   ierr )
+         call mpi_wait( request, status, ierr )
+         if (timeron) call timer_stop(t_rcomm)
+
+         sum = sum + rho
+      enddo
+      rho = sum
+
+
+
+c---------------------------------------------------------------------
+c---->
+c  The conj grad iteration loop
+c---->
+c---------------------------------------------------------------------
+      do cgit = 1, cgitmax
+
+
+c---------------------------------------------------------------------
+c  q = A.p
+c  The partition submatrix-vector multiply: use workspace w
+c---------------------------------------------------------------------
+         do j=1,lastrow-firstrow+1
+            sum = 0.d0
+            do k=rowstr(j),rowstr(j+1)-1
+               sum = sum + a(k)*p(colidx(k))
+            enddo
+            w(j) = sum
+         enddo
+
+c---------------------------------------------------------------------
+c  Sum the partition submatrix-vec A.p's across rows
+c  Exchange and sum piece of w with procs identified in reduce_exch_proc
+c---------------------------------------------------------------------
+         do i = l2npcols, 1, -1
+            if (timeron) call timer_start(t_rcomm)
+            call mpi_irecv( q(reduce_recv_starts(i)),
+     >                      reduce_recv_lengths(i),
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+            call mpi_send(  w(reduce_send_starts(i)),
+     >                      reduce_send_lengths(i),
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_rcomm)
+            do j=send_start,send_start + reduce_recv_lengths(i) - 1
+               w(j) = w(j) + q(j)
+            enddo
+         enddo
+      
+
+c---------------------------------------------------------------------
+c  Exchange piece of q with transpose processor:
+c---------------------------------------------------------------------
+         if( l2npcols .ne. 0 )then
+            if (timeron) call timer_start(t_rcomm)
+            call mpi_irecv( q,               
+     >                      exch_recv_length,
+     >                      dp_type,
+     >                      exch_proc,
+     >                      1,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+
+            call mpi_send(  w(send_start),   
+     >                      send_len,
+     >                      dp_type,
+     >                      exch_proc,
+     >                      1,
+     >                      mpi_comm_world,
+     >                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_rcomm)
+         else
+            do j=1,exch_recv_length
+               q(j) = w(j)
+            enddo
+         endif
+
+
+c---------------------------------------------------------------------
+c  Clear w for reuse...
+c---------------------------------------------------------------------
+         do j=1, max( lastrow-firstrow+1, lastcol-firstcol+1 )
+            w(j) = 0.0d0
+         enddo
+         
+
+c---------------------------------------------------------------------
+c  Obtain p.q
+c---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            sum = sum + p(j)*q(j)
+         enddo
+
+c---------------------------------------------------------------------
+c  Obtain d with a sum-reduce
+c---------------------------------------------------------------------
+         do i = 1, l2npcols
+            if (timeron) call timer_start(t_rcomm)
+            call mpi_irecv( d,
+     >                      1,
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+            call mpi_send(  sum,
+     >                      1,
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      ierr )
+
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_rcomm)
+
+            sum = sum + d
+         enddo
+         d = sum
+
+
+c---------------------------------------------------------------------
+c  Obtain alpha = rho / (p.q)
+c---------------------------------------------------------------------
+         alpha = rho / d
+
+c---------------------------------------------------------------------
+c  Save a temporary of rho
+c---------------------------------------------------------------------
+         rho0 = rho
+
+c---------------------------------------------------------------------
+c  Obtain z = z + alpha*p
+c  and    r = r - alpha*q
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            z(j) = z(j) + alpha*p(j)
+            r(j) = r(j) - alpha*q(j)
+         enddo
+            
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            sum = sum + r(j)*r(j)
+         enddo
+
+c---------------------------------------------------------------------
+c  Obtain rho with a sum-reduce
+c---------------------------------------------------------------------
+         do i = 1, l2npcols
+            if (timeron) call timer_start(t_rcomm)
+            call mpi_irecv( rho,
+     >                      1,
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      request,
+     >                      ierr )
+            call mpi_send(  sum,
+     >                      1,
+     >                      dp_type,
+     >                      reduce_exch_proc(i),
+     >                      i,
+     >                      mpi_comm_world,
+     >                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_rcomm)
+
+            sum = sum + rho
+         enddo
+         rho = sum
+
+c---------------------------------------------------------------------
+c  Obtain beta:
+c---------------------------------------------------------------------
+         beta = rho / rho0
+
+c---------------------------------------------------------------------
+c  p = r + beta*p
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            p(j) = r(j) + beta*p(j)
+         enddo
+
+
+
+      enddo                             ! end of do cgit=1,cgitmax
+
+
+
+c---------------------------------------------------------------------
+c  Compute residual norm explicitly:  ||r|| = ||x - A.z||
+c  First, form A.z
+c  The partition submatrix-vector multiply
+c---------------------------------------------------------------------
+      do j=1,lastrow-firstrow+1
+         sum = 0.d0
+         do k=rowstr(j),rowstr(j+1)-1
+            sum = sum + a(k)*z(colidx(k))
+         enddo
+         w(j) = sum
+      enddo
+
+
+
+c---------------------------------------------------------------------
+c  Sum the partition submatrix-vec A.z's across rows
+c---------------------------------------------------------------------
+      do i = l2npcols, 1, -1
+         if (timeron) call timer_start(t_rcomm)
+         call mpi_irecv( r(reduce_recv_starts(i)),
+     >                   reduce_recv_lengths(i),
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   request,
+     >                   ierr )
+         call mpi_send(  w(reduce_send_starts(i)),
+     >                   reduce_send_lengths(i),
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   ierr )
+         call mpi_wait( request, status, ierr )
+         if (timeron) call timer_stop(t_rcomm)
+
+         do j=send_start,send_start + reduce_recv_lengths(i) - 1
+            w(j) = w(j) + r(j)
+         enddo
+      enddo
+      
+
+c---------------------------------------------------------------------
+c  Exchange piece of q with transpose processor:
+c---------------------------------------------------------------------
+      if( l2npcols .ne. 0 )then
+         if (timeron) call timer_start(t_rcomm)
+         call mpi_irecv( r,               
+     >                   exch_recv_length,
+     >                   dp_type,
+     >                   exch_proc,
+     >                   1,
+     >                   mpi_comm_world,
+     >                   request,
+     >                   ierr )
+   
+         call mpi_send(  w(send_start),   
+     >                   send_len,
+     >                   dp_type,
+     >                   exch_proc,
+     >                   1,
+     >                   mpi_comm_world,
+     >                   ierr )
+         call mpi_wait( request, status, ierr )
+         if (timeron) call timer_stop(t_rcomm)
+      else
+         do j=1,exch_recv_length
+            r(j) = w(j)
+         enddo
+      endif
+
+
+c---------------------------------------------------------------------
+c  At this point, r contains A.z
+c---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            d   = x(j) - r(j)         
+            sum = sum + d*d
+         enddo
+         
+c---------------------------------------------------------------------
+c  Obtain d with a sum-reduce
+c---------------------------------------------------------------------
+      do i = 1, l2npcols
+         if (timeron) call timer_start(t_rcomm)
+         call mpi_irecv( d,
+     >                   1,
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   request,
+     >                   ierr )
+         call mpi_send(  sum,
+     >                   1,
+     >                   dp_type,
+     >                   reduce_exch_proc(i),
+     >                   i,
+     >                   mpi_comm_world,
+     >                   ierr )
+         call mpi_wait( request, status, ierr )
+         if (timeron) call timer_stop(t_rcomm)
+
+         sum = sum + d
+      enddo
+      d = sum
+
+
+      if( me .eq. root ) rnorm = sqrt( d )
+
+      if (timeron) call timer_stop(t_conjg)
+
+
+      return
+      end                               ! end of routine conj_grad
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine makea( n, nz, a, colidx, rowstr, nonzer,
+     >                  firstrow, lastrow, firstcol, lastcol,
+     >                  rcond, arow, acol, aelt, v, iv, shift )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit            none
+      integer             n, nz, nonzer
+      integer             firstrow, lastrow, firstcol, lastcol
+      integer             colidx(nz), rowstr(n+1)
+      integer             iv(2*n+1), arow(nz), acol(nz)
+      double precision    v(n+1), aelt(nz)
+      double precision    rcond, a(nz), shift
+
+c---------------------------------------------------------------------
+c       generate the test problem for benchmark 6
+c       makea generates a sparse matrix with a
+c       prescribed sparsity distribution
+c
+c       parameter    type        usage
+c
+c       input
+c
+c       n            i           number of cols/rows of matrix
+c       nz           i           nonzeros as declared array size
+c       rcond        r*8         condition number
+c       shift        r*8         main diagonal shift
+c
+c       output
+c
+c       a            r*8         array for nonzeros
+c       colidx       i           col indices
+c       rowstr       i           row pointers
+c
+c       workspace
+c
+c       iv, arow, acol i
+c       v, aelt        r*8
+c---------------------------------------------------------------------
+
+      integer i, nnza, iouter, ivelt, ivelt1, irow, nzv, jcol
+
+c---------------------------------------------------------------------
+c      nonzer is approximately  (int(sqrt(nnza /n)));
+c---------------------------------------------------------------------
+
+      double precision  size, ratio, scale
+      external          sparse, sprnvc, vecset
+
+      size = 1.0D0
+      ratio = rcond ** (1.0D0 / dfloat(n))
+      nnza = 0
+
+c---------------------------------------------------------------------
+c  Initialize iv(n+1 .. 2n) to zero.
+c  Used by sprnvc to mark nonzero positions
+c---------------------------------------------------------------------
+
+      do i = 1, n
+           iv(n+i) = 0
+      enddo
+      do iouter = 1, n
+         nzv = nonzer
+         call sprnvc( n, nzv, v, colidx, iv(1), iv(n+1) )
+         call vecset( n, v, colidx, nzv, iouter, .5D0 )
+         do ivelt = 1, nzv
+              jcol = colidx(ivelt)
+              if (jcol.ge.firstcol .and. jcol.le.lastcol) then
+                 scale = size * v(ivelt)
+                 do ivelt1 = 1, nzv
+                    irow = colidx(ivelt1)
+                    if (irow.ge.firstrow .and. irow.le.lastrow) then
+                       nnza = nnza + 1
+                       if (nnza .gt. nz) goto 9999
+                       acol(nnza) = jcol
+                       arow(nnza) = irow
+                       aelt(nnza) = v(ivelt1) * scale
+                    endif
+                 enddo
+              endif
+         enddo
+         size = size * ratio
+      enddo
+
+
+c---------------------------------------------------------------------
+c       ... add the identity * rcond to the generated matrix to bound
+c           the smallest eigenvalue from below by rcond
+c---------------------------------------------------------------------
+        do i = firstrow, lastrow
+           if (i.ge.firstcol .and. i.le.lastcol) then
+              iouter = n + i
+              nnza = nnza + 1
+              if (nnza .gt. nz) goto 9999
+              acol(nnza) = i
+              arow(nnza) = i
+              aelt(nnza) = rcond - shift
+           endif
+        enddo
+
+
+c---------------------------------------------------------------------
+c       ... make the sparse matrix from list of elements with duplicates
+c           (v and iv are used as  workspace)
+c---------------------------------------------------------------------
+      call sparse( a, colidx, rowstr, n, arow, acol, aelt,
+     >             firstrow, lastrow,
+     >             v, iv(1), iv(n+1), nnza )
+      return
+
+ 9999 continue
+      write(*,*) 'Space for matrix elements exceeded in makea'
+      write(*,*) 'nnza, nzmax = ',nnza, nz
+      write(*,*) ' iouter = ',iouter
+
+      stop
+      end
+c-------end   of makea------------------------------
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sparse( a, colidx, rowstr, n, arow, acol, aelt,
+     >                   firstrow, lastrow,
+     >                   x, mark, nzloc, nnza )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            colidx(*), rowstr(*)
+      integer            firstrow, lastrow
+      integer            n, arow(*), acol(*), nnza
+      double precision   a(*), aelt(*)
+
+c---------------------------------------------------------------------
+c       rows range from firstrow to lastrow
+c       the rowstr pointers are defined for nrows = lastrow-firstrow+1 values
+c---------------------------------------------------------------------
+      integer            nzloc(n), nrows
+      double precision   x(n)
+      logical            mark(n)
+
+c---------------------------------------------------
+c       generate a sparse matrix from a list of
+c       [col, row, element] tri
+c---------------------------------------------------
+
+      integer            i, j, jajp1, nza, k, nzrow
+      double precision   xi
+
+c---------------------------------------------------------------------
+c    how many rows of result
+c---------------------------------------------------------------------
+      nrows = lastrow - firstrow + 1
+
+c---------------------------------------------------------------------
+c     ...count the number of triples in each row
+c---------------------------------------------------------------------
+      do j = 1, n
+         rowstr(j) = 0
+         mark(j) = .false.
+      enddo
+      rowstr(n+1) = 0
+
+      do nza = 1, nnza
+         j = (arow(nza) - firstrow + 1) + 1
+         rowstr(j) = rowstr(j) + 1
+      enddo
+
+      rowstr(1) = 1
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) + rowstr(j-1)
+      enddo
+
+
+c---------------------------------------------------------------------
+c     ... rowstr(j) now is the location of the first nonzero
+c           of row j of a
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c     ... do a bucket sort of the triples on the row index
+c---------------------------------------------------------------------
+      do nza = 1, nnza
+         j = arow(nza) - firstrow + 1
+         k = rowstr(j)
+         a(k) = aelt(nza)
+         colidx(k) = acol(nza)
+         rowstr(j) = rowstr(j) + 1
+      enddo
+
+
+c---------------------------------------------------------------------
+c       ... rowstr(j) now points to the first element of row j+1
+c---------------------------------------------------------------------
+      do j = nrows, 1, -1
+          rowstr(j+1) = rowstr(j)
+      enddo
+      rowstr(1) = 1
+
+
+c---------------------------------------------------------------------
+c       ... generate the actual output rows by adding elements
+c---------------------------------------------------------------------
+      nza = 0
+      do i = 1, n
+          x(i)    = 0.0
+          mark(i) = .false.
+      enddo
+
+      jajp1 = rowstr(1)
+      do j = 1, nrows
+         nzrow = 0
+
+c---------------------------------------------------------------------
+c          ...loop over the jth row of a
+c---------------------------------------------------------------------
+         do k = jajp1 , rowstr(j+1)-1
+            i = colidx(k)
+            x(i) = x(i) + a(k)
+            if ( (.not. mark(i)) .and. (x(i) .ne. 0.D0)) then
+             mark(i) = .true.
+             nzrow = nzrow + 1
+             nzloc(nzrow) = i
+            endif
+         enddo
+
+c---------------------------------------------------------------------
+c          ... extract the nonzeros of this row
+c---------------------------------------------------------------------
+         do k = 1, nzrow
+            i = nzloc(k)
+            mark(i) = .false.
+            xi = x(i)
+            x(i) = 0.D0
+            if (xi .ne. 0.D0) then
+             nza = nza + 1
+             a(nza) = xi
+             colidx(nza) = i
+            endif
+         enddo
+         jajp1 = rowstr(j+1)
+         rowstr(j+1) = nza + rowstr(1)
+      enddo
+CC       write (*, 11000) nza
+      return
+11000   format ( //,'final nonzero count in sparse ',
+     1            /,'number of nonzeros       = ', i16 )
+      end
+c-------end   of sparse-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sprnvc( n, nz, v, iv, nzloc, mark )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   v(*)
+      integer            n, nz, iv(*), nzloc(n), nn1
+      integer mark(n)
+      common /urando/    amult, tran
+      double precision   amult, tran
+
+
+c---------------------------------------------------------------------
+c       generate a sparse n-vector (v, iv)
+c       having nzv nonzeros
+c
+c       mark(i) is set to 1 if position i is nonzero.
+c       mark is all zero on entry and is reset to all zero before exit
+c       this corrects a performance bug found by John G. Lewis, caused by
+c       reinitialization of mark on every one of the n calls to sprnvc
+c---------------------------------------------------------------------
+
+        integer            nzrow, nzv, ii, i, icnvrt
+
+        external           randlc, icnvrt
+        double precision   randlc, vecelt, vecloc
+
+
+        nzv = 0
+        nzrow = 0
+        nn1 = 1
+ 50     continue
+          nn1 = 2 * nn1
+          if (nn1 .lt. n) goto 50
+
+c---------------------------------------------------------------------
+c    nn1 is the smallest power of two not less than n
+c---------------------------------------------------------------------
+
+100     continue
+        if (nzv .ge. nz) goto 110
+         vecelt = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c   generate an integer between 1 and n in a portable manner
+c---------------------------------------------------------------------
+         vecloc = randlc(tran, amult)
+         i = icnvrt(vecloc, nn1) + 1
+         if (i .gt. n) goto 100
+
+c---------------------------------------------------------------------
+c  was this integer generated already?
+c---------------------------------------------------------------------
+         if (mark(i) .eq. 0) then
+            mark(i) = 1
+            nzrow = nzrow + 1
+            nzloc(nzrow) = i
+            nzv = nzv + 1
+            v(nzv) = vecelt
+            iv(nzv) = i
+         endif
+         goto 100
+110      continue
+      do ii = 1, nzrow
+         i = nzloc(ii)
+         mark(i) = 0
+      enddo
+      return
+      end
+c-------end   of sprnvc-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      function icnvrt(x, ipwr2)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   x
+      integer            ipwr2, icnvrt
+
+c---------------------------------------------------------------------
+c    scale a double precision number x in (0,1) by a power of 2 and chop it
+c---------------------------------------------------------------------
+      icnvrt = int(ipwr2 * x)
+
+      return
+      end
+c-------end   of icnvrt-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine vecset(n, v, iv, nzv, i, val)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            n, iv(*), nzv, i, k
+      double precision   v(*), val
+
+c---------------------------------------------------------------------
+c       set ith element of sparse vector (v, iv) with
+c       nzv nonzeros to val
+c---------------------------------------------------------------------
+
+      logical set
+
+      set = .false.
+      do k = 1, nzv
+         if (iv(k) .eq. i) then
+            v(k) = val
+            set  = .true.
+         endif
+      enddo
+      if (.not. set) then
+         nzv     = nzv + 1
+         v(nzv)  = val
+         iv(nzv) = i
+      endif
+      return
+      end
+c-------end   of vecset-----------------------------
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/mpinpb.h
new file mode 100644
index 0000000..1f0368c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/mpinpb.h
@@ -0,0 +1,9 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include           'mpif.h'
+
+      integer           me, nprocs, root, dp_type
+      common /mpistuff/ me, nprocs, root, dp_type
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/timing.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/timing.h
new file mode 100644
index 0000000..2000af1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/CG/timing.h
@@ -0,0 +1,5 @@
+      integer t_total, t_conjg, t_rcomm, t_ncomm, t_last
+      parameter (t_total=1, t_conjg=2, t_rcomm=3, t_ncomm=4, t_last=4)
+
+      logical timeron
+      common /timers/ timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.c
new file mode 100644
index 0000000..5d5839d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.c
@@ -0,0 +1,184 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "DGraph.h"
+
+DGArc *newArc(DGNode *tl,DGNode *hd){
+  DGArc *ar=(DGArc *)malloc(sizeof(DGArc));
+  ar->tail=tl;
+  ar->head=hd;
+  return ar;
+}
+void arcShow(DGArc *ar){
+  DGNode *tl=(DGNode *)ar->tail,
+         *hd=(DGNode *)ar->head;
+  fprintf(stderr,"%d. |%s ->%s\n",ar->id,tl->name,hd->name);
+}
+
+DGNode *newNode(char *nm){
+  DGNode *nd=(DGNode *)malloc(sizeof(DGNode));
+  nd->attribute=0;
+  nd->color=0;
+  nd->inDegree=0;
+  nd->outDegree=0;
+  nd->maxInDegree=SMALL_BLOCK_SIZE;
+  nd->maxOutDegree=SMALL_BLOCK_SIZE;
+  nd->inArc=(DGArc **)malloc(nd->maxInDegree*sizeof(DGArc*));
+  nd->outArc=(DGArc **)malloc(nd->maxOutDegree*sizeof(DGArc*));
+  nd->name=strdup(nm);
+  nd->feat=NULL;
+  return nd;
+}
+void nodeShow(DGNode* nd){
+  fprintf( stderr,"%3d.%s: (%d,%d)\n",
+	           nd->id,nd->name,nd->inDegree,nd->outDegree);
+/*
+  if(nd->verified==1) fprintf(stderr,"%ld.%s\t: usable.",nd->id,nd->name);
+  else if(nd->verified==0)  fprintf(stderr,"%ld.%s\t: unusable.",nd->id,nd->name);
+  else  fprintf(stderr,"%ld.%s\t: notverified.",nd->id,nd->name);   
+*/
+}
+
+DGraph* newDGraph(char* nm){
+  DGraph *dg=(DGraph *)malloc(sizeof(DGraph));
+  dg->numNodes=0;
+  dg->numArcs=0;
+  dg->maxNodes=BLOCK_SIZE;
+  dg->maxArcs=BLOCK_SIZE;
+  dg->node=(DGNode **)malloc(dg->maxNodes*sizeof(DGNode*));
+  dg->arc=(DGArc **)malloc(dg->maxArcs*sizeof(DGArc*));
+  dg->name=strdup(nm);
+  return dg;
+}
+int AttachNode(DGraph* dg, DGNode* nd) {
+  int i=0,j,len=0;
+  DGNode **nds =NULL, *tmpnd=NULL;
+  DGArc **ar=NULL;
+
+	if (dg->numNodes == dg->maxNodes-1 ) {
+	  dg->maxNodes += BLOCK_SIZE;
+          nds =(DGNode **) calloc(dg->maxNodes,sizeof(DGNode*));
+	  memcpy(nds,dg->node,(dg->maxNodes-BLOCK_SIZE)*sizeof(DGNode*));
+	  free(dg->node);
+	  dg->node=nds;
+	}
+
+        len = strlen( nd->name);
+	for (i = 0; i < dg->numNodes; i++) {
+	  tmpnd =dg->node[ i];
+	  ar=NULL;
+	  if ( strlen( tmpnd->name) != len ) continue;
+	  if ( strncmp( nd->name, tmpnd->name, len) ) continue;
+	  if ( nd->inDegree > 0 ) {
+	    tmpnd->maxInDegree += nd->maxInDegree;
+            ar =(DGArc **) calloc(tmpnd->maxInDegree,sizeof(DGArc*));
+	    memcpy(ar,tmpnd->inArc,(tmpnd->inDegree)*sizeof(DGArc*));
+	    free(tmpnd->inArc);
+	    tmpnd->inArc=ar;
+	    for (j = 0; j < nd->inDegree; j++ ) {
+	      nd->inArc[ j]->head = tmpnd;
+	    }
+	    memcpy( &(tmpnd->inArc[ tmpnd->inDegree]), nd->inArc, nd->inDegree*sizeof( DGArc *));
+	    tmpnd->inDegree += nd->inDegree;
+	  } 	
+	  if ( nd->outDegree > 0 ) {
+	    tmpnd->maxOutDegree += nd->maxOutDegree;
+            ar =(DGArc **) calloc(tmpnd->maxOutDegree,sizeof(DGArc*));
+	    memcpy(ar,tmpnd->outArc,(tmpnd->outDegree)*sizeof(DGArc*));
+	    free(tmpnd->outArc);
+	    tmpnd->outArc=ar;
+	    for (j = 0; j < nd->outDegree; j++ ) {
+	      nd->outArc[ j]->tail = tmpnd;
+	    }			
+	    memcpy( &(tmpnd->outArc[tmpnd->outDegree]),nd->outArc,nd->outDegree*sizeof( DGArc *));
+	    tmpnd->outDegree += nd->outDegree;
+	  } 
+	  free(nd); 
+	  return i;
+	}
+	nd->id = dg->numNodes;
+	dg->node[dg->numNodes] = nd;
+	dg->numNodes++;
+return nd->id;
+}
+int AttachArc(DGraph *dg,DGArc* nar){
+int	arcId = -1;
+int i=0,newNumber=0;
+DGNode	*head = nar->head,
+	*tail = nar->tail; 
+DGArc **ars=NULL,*probe=NULL;
+/*fprintf(stderr,"AttachArc %ld\n",dg->numArcs); */
+	if ( !tail || !head ) return arcId;
+	if ( dg->numArcs == dg->maxArcs-1 ) {
+	  dg->maxArcs += BLOCK_SIZE;
+          ars =(DGArc **) calloc(dg->maxArcs,sizeof(DGArc*));
+	  memcpy(ars,dg->arc,(dg->maxArcs-BLOCK_SIZE)*sizeof(DGArc*));
+	  free(dg->arc);
+	  dg->arc=ars;
+	}
+	for(i = 0; i < tail->outDegree; i++ ) { /* parallel arc */
+	  probe = tail->outArc[ i];
+	  if(probe->head == head
+	     &&
+	     probe->length == nar->length
+            ){
+            free(nar);
+	    return probe->id;   
+	  }
+	}
+	
+	nar->id = dg->numArcs;
+	arcId=dg->numArcs;
+	dg->arc[dg->numArcs] = nar;
+	dg->numArcs++;
+	
+	head->inArc[ head->inDegree] = nar;
+	head->inDegree++;
+	if ( head->inDegree >= head->maxInDegree ) {
+	  newNumber = head->maxInDegree + SMALL_BLOCK_SIZE;
+          ars =(DGArc **) calloc(newNumber,sizeof(DGArc*));
+	  memcpy(ars,head->inArc,(head->inDegree)*sizeof(DGArc*));
+	  free(head->inArc);
+	  head->inArc=ars;
+	  head->maxInDegree = newNumber;
+	}
+	tail->outArc[ tail->outDegree] = nar;
+	tail->outDegree++;
+	if(tail->outDegree >= tail->maxOutDegree ) {
+	  newNumber = tail->maxOutDegree + SMALL_BLOCK_SIZE;
+          ars =(DGArc **) calloc(newNumber,sizeof(DGArc*));
+	  memcpy(ars,tail->outArc,(tail->outDegree)*sizeof(DGArc*));
+	  free(tail->outArc);
+	  tail->outArc=ars;
+	  tail->maxOutDegree = newNumber;
+	}
+/*fprintf(stderr,"AttachArc: head->in=%d tail->out=%ld\n",head->inDegree,tail->outDegree);*/
+return arcId;
+}
+void graphShow(DGraph *dg,int DetailsLevel){
+  int i=0,j=0;
+  fprintf(stderr,"%d.%s: (%d,%d)\n",dg->id,dg->name,dg->numNodes,dg->numArcs);
+  if ( DetailsLevel < 1) return;
+  for (i = 0; i < dg->numNodes; i++ ) {
+    DGNode *focusNode = dg->node[ i];
+    if(DetailsLevel >= 2) {
+      for (j = 0; j < focusNode->inDegree; j++ ) {
+	fprintf(stderr,"\t ");
+	nodeShow(focusNode->inArc[ j]->tail);
+      }
+    }
+    nodeShow(focusNode);
+    if ( DetailsLevel < 2) continue;
+    for (j = 0; j < focusNode->outDegree; j++ ) {
+      fprintf(stderr, "\t ");
+      nodeShow(focusNode->outArc[ j]->head);
+    }	
+    fprintf(stderr, "---\n");
+  }
+  fprintf(stderr,"----------------------------------------\n");
+  if ( DetailsLevel < 3) return;
+}
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.h
new file mode 100644
index 0000000..f38f898
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/DGraph.h
@@ -0,0 +1,43 @@
+#ifndef _DGRAPH
+#define _DGRAPH
+
+#define BLOCK_SIZE  128
+#define SMALL_BLOCK_SIZE 32
+
+typedef struct{
+  int id;
+  void *tail,*head;
+  int length,width,attribute,maxWidth;
+}DGArc;
+
+typedef struct{
+  int maxInDegree,maxOutDegree;
+  int inDegree,outDegree;
+  int id;
+  char *name;
+  DGArc **inArc,**outArc;
+  int depth,height,width;
+  int color,attribute,address,verified;
+  void *feat;
+}DGNode;
+
+typedef struct{
+  int maxNodes,maxArcs;
+  int id;
+  char *name;
+  int numNodes,numArcs;
+  DGNode **node;
+  DGArc **arc;
+} DGraph;
+
+DGArc *newArc(DGNode *tl,DGNode *hd);
+void arcShow(DGArc *ar);
+DGNode *newNode(char *nm);
+void nodeShow(DGNode* nd);
+
+DGraph* newDGraph(char *nm);
+int AttachNode(DGraph *dg,DGNode *nd);
+int AttachArc(DGraph *dg,DGArc* nar);
+void graphShow(DGraph *dg,int DetailsLevel);
+
+#endif
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/Makefile
new file mode 100644
index 0000000..687ac33
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/Makefile
@@ -0,0 +1,26 @@
+SHELL=/bin/sh
+BENCHMARK=dt
+BENCHMARKU=DT
+
+include ../config/make.def
+
+include ../sys/make.common
+#Override PROGRAM
+DTPROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).x
+
+OBJS = dt.o DGraph.o \
+	${COMMON}/c_print_results.o ${COMMON}/c_timers.o ${COMMON}/c_randdp.o
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${DTPROGRAM} ${OBJS} ${CMPI_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+dt.o:             dt.c  npbparams.h
+DGraph.o:	DGraph.c DGraph.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f dt npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/README
new file mode 100644
index 0000000..873e3ae
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/README
@@ -0,0 +1,22 @@
+Data Traffic benchmark DT is new in the NPB suite 
+(released as part of NPB3.x-MPI package).
+----------------------------------------------------
+
+DT is written in C and same executable can run on any number of processors,
+provided this number is not less than the number of nodes in the communication
+graph.  DT benchmark takes one argument: BH, WH, or SH. This argument 
+specifies the communication graph Black Hole, White Hole, or SHuffle 
+respectively. The current release contains verification numbers for 
+CLASSES S, W, A, and B only.  Classes C and D are defined, but verification 
+numbers are not provided in this release.
+
+The following table summarizes the number of nodes in the communication
+graph based on CLASS and graph TYPE.
+
+CLASS  N_Source N_Nodes(BH,WH) N_Nodes(SH)
+ S      4        5              12
+ W      8        11             32
+ A      16       21             80
+ B      32       43             192
+ C      64       85             448
+ D      128      171            1024
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/dt.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/dt.c
new file mode 100644
index 0000000..1ee85f6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/DT/dt.c
@@ -0,0 +1,759 @@
+/*************************************************************************
+ *                                                                       * 
+ *        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3       *
+ *                                                                       * 
+ *                                  D T					 * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is part of the NAS Parallel Benchmark 3.3 suite.     *
+ *                                                                       * 
+ *   Permission to use, copy, distribute and modify this software        * 
+ *   for any purpose with or without fee is hereby granted.  We          * 
+ *   request, however, that all derived work reference the NAS           * 
+ *   Parallel Benchmarks 3.3. This software is provided "as is"          *
+ *   without express or implied warranty.                                * 
+ *                                                                       * 
+ *   Information on NPB 3.3, including the technical report, the         *
+ *   original specifications, source code, results and information       * 
+ *   on how to submit new results, is available at:                      * 
+ *                                                                       * 
+ *          http:  www.nas.nasa.gov/Software/NPB                         * 
+ *                                                                       * 
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   * 
+ *   Send bug reports to              npb-bugs@nas.nasa.gov              * 
+ *                                                                       * 
+ *         NAS Parallel Benchmarks Group                                 * 
+ *         NASA Ames Research Center                                     * 
+ *         Mail Stop: T27A-1                                             * 
+ *         Moffett Field, CA   94035-1000                                * 
+ *                                                                       * 
+ *         E-mail:  npb@nas.nasa.gov                                     * 
+ *         Fax:     (650) 604-3957                                       * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Frumkin							 *						 * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "mpi.h"
+#include "npbparams.h"
+
+#ifndef CLASS
+#define CLASS 'S'
+#define NUM_PROCS            1                 
+#endif
+
+int      passed_verification;
+extern double randlc( double *X, double *A );
+extern
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_compiled,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+		      
+void    timer_clear( int n );
+void    timer_start( int n );
+void    timer_stop( int n );
+double  timer_read( int n );
+int timer_on=0,timers_tot=64;
+
+int verify(char *bmname,double rnm2){
+    double verify_value=0.0;
+    double epsilon=1.0E-8;
+    char cls=CLASS;
+    int verified=-1;
+    if (cls != 'U') {
+       if(cls=='S') {
+         if(strstr(bmname,"BH")){
+           verify_value=30892725.0;
+         }else if(strstr(bmname,"WH")){
+           verify_value=67349758.0;
+         }else if(strstr(bmname,"SH")){
+           verify_value=58875767.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = 0;
+       }else if(cls=='W') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 4102461.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 204280762.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 186944764.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = 0;
+       }else if(cls=='A') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 17809491.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 1289925229.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 610856482.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+  	 verified = 0;
+       }else if(cls=='B') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 4317114.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 7877279917.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 1836863082.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+  	   verified = 0;
+         }
+       }else if(cls=='C') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 0.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+  	   verified = -1;
+         }
+       }else if(cls=='D') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 0.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = -1;
+       }else{
+         fprintf(stderr,"No such class as %c.\n",cls);
+       }
+       fprintf(stderr," %s L2 Norm = %f\n",bmname,rnm2);
+       if(verified==-1){
+  	 fprintf(stderr," No verification was performed.\n");
+       }else if( rnm2 - verify_value < epsilon &&
+                 rnm2 - verify_value > -epsilon) {  /* abs here does not work on ALTIX */
+  	  verified = 1;
+  	  fprintf(stderr," Deviation = %f\n",(rnm2 - verify_value));
+       }else{
+  	 verified = 0;
+  	 fprintf(stderr," The correct verification value = %f\n",verify_value);
+  	 fprintf(stderr," Got value = %f\n",rnm2);
+       }
+    }else{
+       verified = -1;
+    }
+    return  verified;  
+  }
+
+int ipowMod(int a,long long int n,int md){ 
+  int seed=1,q=a,r=1;
+  if(n<0){
+    fprintf(stderr,"ipowMod: exponent must be nonnegative exp=%lld\n",n);
+    n=-n; /* temp fix */
+/*    return 1; */
+  }
+  if(md<=0){
+    fprintf(stderr,"ipowMod: module must be positive mod=%d",md);
+    return 1;
+  }
+  if(n==0) return 1;
+  while(n>1){
+    int n2 = n/2;
+    if (n2*2==n){
+       seed = (q*q)%md;
+       q=seed;
+       n = n2;
+    }else{
+       seed = (r*q)%md;
+       r=seed;
+       n = n-1;
+    }
+  }
+  seed = (r*q)%md;
+  return seed;
+}
+
+#include "DGraph.h"
+DGraph *buildSH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  DGraph *dg;
+  int numSources=NUM_SOURCES; /* must be power of 2 */
+  int numOfLayers=0,tmpS=numSources>>1;
+  int firstLayerNode=0;
+  DGArc *ar=NULL;
+  DGNode *nd=NULL;
+  int mask=0x0,ndid=0,ndoff=0;
+  int i=0,j=0;
+  char nm[BLOCK_SIZE];
+  
+  sprintf(nm,"DT_SH.%c",cls);
+  dg=newDGraph(nm);
+
+  while(tmpS>1){
+    numOfLayers++;
+    tmpS>>=1;
+  }
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Source.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  for(j=0;j<numOfLayers;j++){
+    mask=0x00000001<<j;
+    for(i=0;i<numSources;i++){
+      sprintf(nm,"Comparator.%d",(i+j*firstLayerNode));
+      nd=newNode(nm);
+      AttachNode(dg,nd);
+      ndoff=i&(~mask);
+      ndid=firstLayerNode+ndoff;
+      ar=newArc(dg->node[ndid],nd);     
+      AttachArc(dg,ar);
+      ndoff+=mask;
+      ndid=firstLayerNode+ndoff;
+      ar=newArc(dg->node[ndid],nd);     
+      AttachArc(dg,ar);
+    }
+    firstLayerNode+=numSources;
+  }
+  mask=0x00000001<<numOfLayers;
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Sink.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+    ndoff=i&(~mask);
+    ndid=firstLayerNode+ndoff;
+    ar=newArc(dg->node[ndid],nd);     
+    AttachArc(dg,ar);
+    ndoff+=mask;
+    ndid=firstLayerNode+ndoff;
+    ar=newArc(dg->node[ndid],nd);     
+    AttachArc(dg,ar);
+  }
+return dg;
+}
+DGraph *buildWH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  int i=0,j=0;
+  int numSources=NUM_SOURCES,maxInDeg=4;
+  int numLayerNodes=numSources,firstLayerNode=0;
+  int totComparators=0;
+  int numPrevLayerNodes=numLayerNodes;
+  int id=0,sid=0;
+  DGraph *dg;
+  DGNode *nd=NULL,*source=NULL,*tmp=NULL,*snd=NULL;
+  DGArc *ar=NULL;
+  char nm[BLOCK_SIZE];
+
+  sprintf(nm,"DT_WH.%c",cls);
+  dg=newDGraph(nm);
+
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Sink.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  totComparators=0;
+  numPrevLayerNodes=numLayerNodes;
+  while(numLayerNodes>maxInDeg){
+    numLayerNodes=numLayerNodes/maxInDeg;
+    if(numLayerNodes*maxInDeg<numPrevLayerNodes)numLayerNodes++;
+    for(i=0;i<numLayerNodes;i++){
+      sprintf(nm,"Comparator.%d",totComparators);
+      totComparators++;
+      nd=newNode(nm);
+      id=AttachNode(dg,nd);
+      for(j=0;j<maxInDeg;j++){
+        sid=i*maxInDeg+j;
+	if(sid>=numPrevLayerNodes) break;
+        snd=dg->node[firstLayerNode+sid];
+        ar=newArc(dg->node[id],snd);
+        AttachArc(dg,ar);
+      }
+    }
+    firstLayerNode+=numPrevLayerNodes;
+    numPrevLayerNodes=numLayerNodes;
+  }
+  source=newNode("Source");
+  AttachNode(dg,source);   
+  for(i=0;i<numPrevLayerNodes;i++){
+    nd=dg->node[firstLayerNode+i];
+    ar=newArc(source,nd);
+    AttachArc(dg,ar);
+  }
+
+  for(i=0;i<dg->numNodes/2;i++){  /* Topological sorting */
+    tmp=dg->node[i];
+    dg->node[i]=dg->node[dg->numNodes-1-i];
+    dg->node[i]->id=i;
+    dg->node[dg->numNodes-1-i]=tmp;
+    dg->node[dg->numNodes-1-i]->id=dg->numNodes-1-i;
+  }
+return dg;
+}
+DGraph *buildBH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  int i=0,j=0;
+  int numSources=NUM_SOURCES,maxInDeg=4;
+  int numLayerNodes=numSources,firstLayerNode=0;
+  DGraph *dg;
+  DGNode *nd=NULL, *snd=NULL, *sink=NULL;
+  DGArc *ar=NULL;
+  int totComparators=0;
+  int numPrevLayerNodes=numLayerNodes;
+  int id=0, sid=0;
+  char nm[BLOCK_SIZE];
+
+  sprintf(nm,"DT_BH.%c",cls);
+  dg=newDGraph(nm);
+
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Source.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  while(numLayerNodes>maxInDeg){
+    numLayerNodes=numLayerNodes/maxInDeg;
+    if(numLayerNodes*maxInDeg<numPrevLayerNodes)numLayerNodes++;
+    for(i=0;i<numLayerNodes;i++){
+      sprintf(nm,"Comparator.%d",totComparators);
+      totComparators++;
+      nd=newNode(nm);
+      id=AttachNode(dg,nd);
+      for(j=0;j<maxInDeg;j++){
+        sid=i*maxInDeg+j;
+	if(sid>=numPrevLayerNodes) break;
+        snd=dg->node[firstLayerNode+sid];
+        ar=newArc(snd,dg->node[id]);
+        AttachArc(dg,ar);
+      }
+    }
+    firstLayerNode+=numPrevLayerNodes;
+    numPrevLayerNodes=numLayerNodes;
+  }
+  sink=newNode("Sink");
+  AttachNode(dg,sink);   
+  for(i=0;i<numPrevLayerNodes;i++){
+    nd=dg->node[firstLayerNode+i];
+    ar=newArc(nd,sink);
+    AttachArc(dg,ar);
+  }
+return dg;
+}
+
+typedef struct{
+  int len;
+  double* val;
+} Arr;
+Arr *newArr(int len){
+  Arr *arr=(Arr *)malloc(sizeof(Arr));
+  arr->len=len;
+  arr->val=(double *)malloc(len*sizeof(double));
+  return arr;
+}
+void arrShow(Arr* a){
+  if(!a) fprintf(stderr,"-- NULL array\n");
+  else{
+    fprintf(stderr,"-- length=%d\n",a->len);
+  }
+}
+double CheckVal(Arr *feat){
+  double csum=0.0;
+  int i=0;
+  for(i=0;i<feat->len;i++){
+    csum+=feat->val[i]*feat->val[i]/feat->len; /* The truncation does not work since 
+                                                  result will be 0 for large len  */
+  }
+   return csum;
+}
+int GetFNumDPar(int* mean, int* stdev){
+  *mean=NUM_SAMPLES;
+  *stdev=STD_DEVIATION;
+  return 0;
+}
+int GetFeatureNum(char *mbname,int id){
+  double tran=314159265.0;
+  double A=2*id+1;
+  double denom=randlc(&tran,&A);
+  char cval='S';
+  int mean=NUM_SAMPLES,stdev=128;
+  int rtfs=0,len=0;
+  GetFNumDPar(&mean,&stdev);
+  rtfs=ipowMod((int)(1/denom)*(int)cval,(long long int) (2*id+1),2*stdev);
+  if(rtfs<0) rtfs=-rtfs;
+  len=mean-stdev+rtfs;
+  return len;
+}
+Arr* RandomFeatures(char *bmname,int fdim,int id){
+  int len=GetFeatureNum(bmname,id)*fdim;
+  Arr* feat=newArr(len);
+  int nxg=2,nyg=2,nzg=2,nfg=5;
+  int nx=421,ny=419,nz=1427,nf=3527;
+  long long int expon=(len*(id+1))%3141592;
+  int seedx=ipowMod(nxg,expon,nx),
+      seedy=ipowMod(nyg,expon,ny),
+      seedz=ipowMod(nzg,expon,nz),
+      seedf=ipowMod(nfg,expon,nf);
+  int i=0;
+  if(timer_on){
+    timer_clear(id+1);
+    timer_start(id+1);
+  }
+  for(i=0;i<len;i+=fdim){
+    seedx=(seedx*nxg)%nx;
+    seedy=(seedy*nyg)%ny;
+    seedz=(seedz*nzg)%nz;
+    seedf=(seedf*nfg)%nf;
+    feat->val[i]=seedx;
+    feat->val[i+1]=seedy;
+    feat->val[i+2]=seedz;
+    feat->val[i+3]=seedf;
+  }
+  if(timer_on){
+    timer_stop(id+1);
+    fprintf(stderr,"** RandomFeatures time in node %d = %f\n",id,timer_read(id+1));
+  }
+  return feat;   
+}
+void Resample(Arr *a,int blen){
+    long long int i=0,j=0,jlo=0,jhi=0;
+    double avval=0.0;
+    double *nval=(double *)malloc(blen*sizeof(double));
+    Arr *tmp=newArr(10);
+    for(i=0;i<blen;i++) nval[i]=0.0;
+    for(i=1;i<a->len-1;i++){
+      jlo=(int)(0.5*(2*i-1)*(blen/a->len)); 
+      jhi=(int)(0.5*(2*i+1)*(blen/a->len));
+
+      avval=a->val[i]/(jhi-jlo+1);    
+      for(j=jlo;j<=jhi;j++){
+        nval[j]+=avval;
+      }
+    }
+    nval[0]=a->val[0];
+    nval[blen-1]=a->val[a->len-1];
+    free(a->val);
+    a->val=nval;
+    a->len=blen;
+}
+#define fielddim 4
+Arr* WindowFilter(Arr *a, Arr* b,int w){
+  int i=0,j=0,k=0;
+  double rms0=0.0,rms1=0.0,rmsm1=0.0;
+  double weight=((double) (w+1))/(w+2);
+ 
+  w+=1;
+  if(timer_on){
+    timer_clear(w);
+    timer_start(w);
+  }
+  if(a->len<b->len) Resample(a,b->len);
+  if(a->len>b->len) Resample(b,a->len);
+  for(i=fielddim;i<a->len-fielddim;i+=fielddim){
+    rms0=(a->val[i]-b->val[i])*(a->val[i]-b->val[i])
+	+(a->val[i+1]-b->val[i+1])*(a->val[i+1]-b->val[i+1])
+	+(a->val[i+2]-b->val[i+2])*(a->val[i+2]-b->val[i+2])
+	+(a->val[i+3]-b->val[i+3])*(a->val[i+3]-b->val[i+3]);
+    j=i+fielddim;
+    rms1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j])
+    	+(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1])
+    	+(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2])
+    	+(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]);
+    j=i-fielddim;
+    rmsm1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j])
+	 +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1])
+	 +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2])
+	 +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]);
+    k=0;
+    if(rms1<rms0){
+      k=1;
+      rms0=rms1;
+    }
+    if(rmsm1<rms0) k=-1;
+    if(k==0){
+      j=i+fielddim;
+      a->val[i]=weight*b->val[i];
+      a->val[i+1]=weight*b->val[i+1];
+      a->val[i+2]=weight*b->val[i+2];
+      a->val[i+3]=weight*b->val[i+3];  
+    }else if(k==1){
+      j=i+fielddim;
+      a->val[i]=weight*b->val[j];
+      a->val[i+1]=weight*b->val[j+1];
+      a->val[i+2]=weight*b->val[j+2];
+      a->val[i+3]=weight*b->val[j+3];  
+    }else { /*if(k==-1)*/
+      j=i-fielddim;
+      a->val[i]=weight*b->val[j];
+      a->val[i+1]=weight*b->val[j+1];
+      a->val[i+2]=weight*b->val[j+2];
+      a->val[i+3]=weight*b->val[j+3];  
+    }	   
+  }
+  if(timer_on){
+    timer_stop(w);
+    fprintf(stderr,"** WindowFilter time in node %d = %f\n",(w-1),timer_read(w));
+  }
+  return a;
+}
+
+int SendResults(DGraph *dg,DGNode *nd,Arr *feat){
+  int i=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *head=NULL;
+  if(!feat) return 0;
+  for(i=0;i<nd->outDegree;i++){
+    ar=nd->outArc[i];
+    if(ar->tail!=nd) continue;
+    head=ar->head;
+    tag=ar->id;
+    if(head->address!=nd->address){
+      MPI_Send(&feat->len,1,MPI_INT,head->address,tag,MPI_COMM_WORLD);
+      MPI_Send(feat->val,feat->len,MPI_DOUBLE,head->address,tag,MPI_COMM_WORLD);
+    }
+  }
+  return 1;
+}
+Arr* CombineStreams(DGraph *dg,DGNode *nd){
+  Arr *resfeat=newArr(NUM_SAMPLES*fielddim);
+  int i=0,len=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *tail=NULL;
+  MPI_Status status;
+  Arr *feat=NULL,*featp=NULL;
+
+  if(nd->inDegree==0) return NULL;
+  for(i=0;i<nd->inDegree;i++){
+    ar=nd->inArc[i];
+    if(ar->head!=nd) continue;
+    tail=ar->tail;
+    if(tail->address!=nd->address){
+      len=0;
+      tag=ar->id;
+      MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status);
+      feat=newArr(len);
+      MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status);
+      resfeat=WindowFilter(resfeat,feat,nd->id);
+      free(feat);
+    }else{
+      featp=(Arr *)tail->feat;
+      feat=newArr(featp->len);
+      memcpy(feat->val,featp->val,featp->len*sizeof(double));
+      resfeat=WindowFilter(resfeat,feat,nd->id);  
+      free(feat);
+    }
+  }
+  for(i=0;i<resfeat->len;i++) resfeat->val[i]=((int)resfeat->val[i])/nd->inDegree;
+  nd->feat=resfeat;
+  return nd->feat;
+}
+double Reduce(Arr *a,int w){
+  double retv=0.0;
+  if(timer_on){
+    timer_clear(w);
+    timer_start(w);
+  }
+  retv=(int)(w*CheckVal(a));/* The casting needed for node  
+                               and array dependent verifcation */
+  if(timer_on){
+    timer_stop(w);
+    fprintf(stderr,"** Reduce time in node %d = %f\n",(w-1),timer_read(w));
+  }
+  return retv;
+}
+
+double ReduceStreams(DGraph *dg,DGNode *nd){
+  double csum=0.0;
+  int i=0,len=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *tail=NULL;
+  Arr *feat=NULL;
+  double retv=0.0;
+
+  for(i=0;i<nd->inDegree;i++){
+    ar=nd->inArc[i];
+    if(ar->head!=nd) continue;
+    tail=ar->tail;
+    if(tail->address!=nd->address){
+      MPI_Status status;
+      len=0;
+      tag=ar->id;
+      MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status);
+      feat=newArr(len);
+      MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status);
+      csum+=Reduce(feat,(nd->id+1));  
+      free(feat);
+    }else{
+      csum+=Reduce(tail->feat,(nd->id+1));  
+    }
+  }
+  if(nd->inDegree>0)csum=(((long long int)csum)/nd->inDegree);
+  retv=(nd->id+1)*csum;
+  return retv;
+}
+
+int ProcessNodes(DGraph *dg,int me){
+  double chksum=0.0;
+  Arr *feat=NULL;
+  int i=0,verified=0,tag;
+  DGNode *nd=NULL;
+  double rchksum=0.0;
+  MPI_Status status;
+
+  for(i=0;i<dg->numNodes;i++){
+    nd=dg->node[i];
+    if(nd->address!=me) continue;
+    if(strstr(nd->name,"Source")){
+      nd->feat=RandomFeatures(dg->name,fielddim,nd->id); 
+      SendResults(dg,nd,nd->feat);
+    }else if(strstr(nd->name,"Sink")){
+      chksum=ReduceStreams(dg,nd);
+      tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */
+      MPI_Send(&chksum,1,MPI_DOUBLE,0,tag,MPI_COMM_WORLD);
+    }else{
+      feat=CombineStreams(dg,nd);
+      SendResults(dg,nd,feat);
+    }
+  }
+  if(me==0){ /* Report node */
+    rchksum=0.0;
+    chksum=0.0;
+    for(i=0;i<dg->numNodes;i++){
+      nd=dg->node[i];
+      if(!strstr(nd->name,"Sink")) continue;
+       tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */
+      MPI_Recv(&rchksum,1,MPI_DOUBLE,nd->address,tag,MPI_COMM_WORLD,&status);
+      chksum+=rchksum;
+    }
+    verified=verify(dg->name,chksum);
+  }
+return verified;
+}
+
+int main(int argc,char **argv ){
+  int my_rank,comm_size;
+  int i;
+  DGraph *dg=NULL;
+  int verified=0, featnum=0;
+  double bytes_sent=2.0,tot_time=0.0;
+
+    MPI_Init( &argc, &argv );
+    MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
+    MPI_Comm_size( MPI_COMM_WORLD, &comm_size );
+
+     if(argc!=2||
+                (  strncmp(argv[1],"BH",2)!=0
+                 &&strncmp(argv[1],"WH",2)!=0
+                 &&strncmp(argv[1],"SH",2)!=0
+                )
+      ){
+      if(my_rank==0){
+        fprintf(stderr,"** Usage: mpirun -np N ../bin/dt.S GraphName\n");
+        fprintf(stderr,"** Where \n   - N is integer number of MPI processes\n");
+        fprintf(stderr,"   - S is the class S, W, or A \n");
+        fprintf(stderr,"   - GraphName is the communication graph name BH, WH, or SH.\n");
+        fprintf(stderr,"   - the number of MPI processes N should not be be less than \n");
+        fprintf(stderr,"     the number of nodes in the graph\n");
+      }
+      MPI_Finalize();
+      exit(0);
+    } 
+   if(strncmp(argv[1],"BH",2)==0){
+      dg=buildBH(CLASS);
+    }else if(strncmp(argv[1],"WH",2)==0){
+      dg=buildWH(CLASS);
+    }else if(strncmp(argv[1],"SH",2)==0){
+      dg=buildSH(CLASS);
+    }
+
+    if(timer_on&&dg->numNodes+1>timers_tot){
+      timer_on=0;
+      if(my_rank==0)
+        fprintf(stderr,"Not enough timers. Node timeing is off. \n");
+    }
+    if(dg->numNodes>comm_size){
+      if(my_rank==0){
+        fprintf(stderr,"**  The number of MPI processes should not be less than \n");
+        fprintf(stderr,"**  the number of nodes in the graph\n");
+        fprintf(stderr,"**  Number of MPI processes = %d\n",comm_size);
+        fprintf(stderr,"**  Number nodes in the graph = %d\n",dg->numNodes);
+      }
+      MPI_Finalize();
+      exit(0);
+    }
+    for(i=0;i<dg->numNodes;i++){ 
+      dg->node[i]->address=i;
+    }
+    if( my_rank == 0 ){
+      printf( "\n\n NAS Parallel Benchmarks 3.3 -- DT Benchmark\n\n" );
+      graphShow(dg,0);
+      timer_clear(0);
+      timer_start(0);
+    }
+    verified=ProcessNodes(dg,my_rank);
+    
+    featnum=NUM_SAMPLES*fielddim;
+    bytes_sent=featnum*dg->numArcs;
+    bytes_sent/=1048576;
+    if(my_rank==0){
+      timer_stop(0);
+      tot_time=timer_read(0);
+      c_print_results( dg->name,
+        	       CLASS,
+        	       featnum,
+        	       0,
+        	       0,
+        	       dg->numNodes,
+        	       0,
+        	       comm_size,
+        	       tot_time,
+        	       bytes_sent/tot_time,
+        	       "bytes transmitted", 
+        	       verified,
+        	       NPBVERSION,
+        	       COMPILETIME,
+        	       MPICC,
+        	       CLINK,
+        	       CMPI_LIB,
+        	       CMPI_INC,
+        	       CFLAGS,
+        	       CLINKFLAGS );
+    }          
+    MPI_Finalize();
+  return 1;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/Makefile
new file mode 100644
index 0000000..5fa8cc3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=ep
+BENCHMARKU=EP
+
+include ../config/make.def
+
+OBJS = ep.o ${COMMON}/print_results.o ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+
+ep.o:		ep.f  mpinpb.h npbparams.h
+	${FCOMPILE} ep.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/README
new file mode 100644
index 0000000..6eb3657
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/README
@@ -0,0 +1,6 @@
+This code implements the random-number generator described in the
+NAS Parallel Benchmark document RNR Technical Report RNR-94-007.
+The code is "embarrassingly" parallel in that no communication is
+required for the generation of the random numbers itself. There is
+no special requirement on the number of processors used for running
+the benchmark.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/ep.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/ep.f
new file mode 100644
index 0000000..c112100
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/ep.f
@@ -0,0 +1,359 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   E P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: P. O. Frederickson 
+c          D. H. Bailey
+c          A. C. Woo
+c          R. F. Van der Wijngaart
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program EMBAR
+c---------------------------------------------------------------------
+C
+c   This is the MPI version of the APP Benchmark 1,
+c   the "embarassingly parallel" benchmark.
+c
+c
+c   M is the Log_2 of the number of complex pairs of uniform (0, 1) random
+c   numbers.  MK is the Log_2 of the size of each batch of uniform random
+c   numbers.  MK can be set for convenience on a given system, since it does
+c   not affect the results.
+
+      implicit none
+
+      include 'npbparams.h'
+      include 'mpinpb.h'
+
+      double precision Mops, epsilon, a, s, t1, t2, t3, t4, x, x1, 
+     >                 x2, q, sx, sy, tm, an, tt, gc, dum(3),
+     >                 timer_read
+      double precision sx_verify_value, sy_verify_value, sx_err, sy_err
+      integer          mk, mm, nn, nk, nq, np, ierr, node, no_nodes, 
+     >                 i, ik, kk, l, k, nit, ierrcode, no_large_nodes,
+     >                 np_add, k_offset, j
+      logical          verified, timers_enabled
+      external         randlc, timer_read
+      double precision randlc, qq
+      character*15     size
+
+      integer          fstatus
+      integer          t_total, t_gpairs, t_randn, t_rcomm, t_last
+      parameter (t_total=1, t_gpairs=2, t_randn=3, t_rcomm=4, t_last=4)
+      double precision tsum(t_last+2), t1m(t_last+2),
+     >                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      parameter (mk = 16, mm = m - mk, nn = 2 ** mm,
+     >           nk = 2 ** mk, nq = 10, epsilon=1.d-8,
+     >           a = 1220703125.d0, s = 271828183.d0)
+
+      common/storage/ x(2*nk), q(0:nq-1), qq(10000)
+      data             dum /1.d0, 1.d0, 1.d0/
+
+      data t_recs/'total', 'gpairs', 'randn', 'rcomm',
+     >            ' totcomp', ' totcomm'/
+
+
+      call mpi_init(ierr)
+      call mpi_comm_rank(MPI_COMM_WORLD,node,ierr)
+      call mpi_comm_size(MPI_COMM_WORLD,no_nodes,ierr)
+
+      root = 0
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+      if (node.eq.root)  then
+
+c   Because the size of the problem is too large to store in a 32-bit
+c   integer for some classes, we put it into a string (for printing).
+c   Have to strip off the decimal point put in there by the floating
+c   point print statement (internal file)
+
+          write(*, 1000)
+          write(size, '(f15.0)' ) 2.d0**(m+1)
+          j = 15
+          if (size(j:j) .eq. '.') j = j - 1
+          write (*,1001) size(1:j)
+          write(*, 1003) no_nodes
+
+ 1000 format(/,' NAS Parallel Benchmarks 3.3 -- EP Benchmark',/)
+ 1001     format(' Number of random numbers generated: ', a15)
+ 1003     format(' Number of active processes:         ', 2x, i13, /)
+
+          open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+          timers_enabled = .false.
+          if (fstatus .eq. 0) then
+             timers_enabled = .true.
+             close(2)
+          endif
+      endif
+
+      call mpi_bcast(timers_enabled, 1, MPI_LOGICAL, root, 
+     >               MPI_COMM_WORLD, ierr)
+
+      verified = .false.
+
+c   Compute the number of "batches" of random number pairs generated 
+c   per processor. Adjust if the number of processors does not evenly 
+c   divide the total number
+
+      np = nn / no_nodes
+      no_large_nodes = mod(nn, no_nodes)
+      if (node .lt. no_large_nodes) then
+         np_add = 1
+      else
+         np_add = 0
+      endif
+      np = np + np_add
+
+      if (np .eq. 0) then
+         write (6, 1) no_nodes, nn
+ 1       format ('Too many nodes:',2i6)
+         ierrcode = 1
+         call mpi_abort(MPI_COMM_WORLD,ierrcode,ierr)
+         stop
+      endif
+
+c   Call the random number generator functions and initialize
+c   the x-array to reduce the effects of paging on the timings.
+c   Also, call all mathematical functions that are used. Make
+c   sure these initializations cannot be eliminated as dead code.
+
+      call vranlc(0, dum(1), dum(2), dum(3))
+      dum(1) = randlc(dum(2), dum(3))
+      do 5    i = 1, 2*nk
+         x(i) = -1.d99
+ 5    continue
+      Mops = log(sqrt(abs(max(1.d0,1.d0))))
+
+c---------------------------------------------------------------------
+c      Synchronize before placing time stamp
+c---------------------------------------------------------------------
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call mpi_barrier(MPI_COMM_WORLD, ierr)
+      call timer_start(1)
+
+      t1 = a
+      call vranlc(0, t1, a, x)
+
+c   Compute AN = A ^ (2 * NK) (mod 2^46).
+
+      t1 = a
+
+      do 100 i = 1, mk + 1
+         t2 = randlc(t1, t1)
+ 100  continue
+
+      an = t1
+      tt = s
+      gc = 0.d0
+      sx = 0.d0
+      sy = 0.d0
+
+      do 110 i = 0, nq - 1
+         q(i) = 0.d0
+ 110  continue
+
+c   Each instance of this loop may be performed independently. We compute
+c   the k offsets separately to take into account the fact that some nodes
+c   have more numbers to generate than others
+
+      if (np_add .eq. 1) then
+         k_offset = node * np -1
+      else
+         k_offset = no_large_nodes*(np+1) + (node-no_large_nodes)*np -1
+      endif
+
+      do 150 k = 1, np
+         kk = k_offset + k 
+         t1 = s
+         t2 = an
+
+c        Find starting seed t1 for this kk.
+
+         do 120 i = 1, 100
+            ik = kk / 2
+            if (2 * ik .ne. kk) t3 = randlc(t1, t2)
+            if (ik .eq. 0) goto 130
+            t3 = randlc(t2, t2)
+            kk = ik
+ 120     continue
+
+c        Compute uniform pseudorandom numbers.
+ 130     continue
+
+         if (timers_enabled) call timer_start(t_randn)
+         call vranlc(2 * nk, t1, a, x)
+         if (timers_enabled) call timer_stop(t_randn)
+
+c        Compute Gaussian deviates by acceptance-rejection method and 
+c        tally counts in concentric square annuli.  This loop is not 
+c        vectorizable. 
+
+         if (timers_enabled) call timer_start(t_gpairs)
+
+         do 140 i = 1, nk
+            x1 = 2.d0 * x(2*i-1) - 1.d0
+            x2 = 2.d0 * x(2*i) - 1.d0
+            t1 = x1 ** 2 + x2 ** 2
+            if (t1 .le. 1.d0) then
+               t2   = sqrt(-2.d0 * log(t1) / t1)
+               t3   = (x1 * t2)
+               t4   = (x2 * t2)
+               l    = max(abs(t3), abs(t4))
+               q(l) = q(l) + 1.d0
+               sx   = sx + t3
+               sy   = sy + t4
+            endif
+ 140     continue
+
+         if (timers_enabled) call timer_stop(t_gpairs)
+
+ 150  continue
+
+      if (timers_enabled) call timer_start(t_rcomm)
+      call mpi_allreduce(sx, x, 1, dp_type,
+     >                   MPI_SUM, MPI_COMM_WORLD, ierr)
+      sx = x(1)
+      call mpi_allreduce(sy, x, 1, dp_type,
+     >                   MPI_SUM, MPI_COMM_WORLD, ierr)
+      sy = x(1)
+      call mpi_allreduce(q, x, nq, dp_type,
+     >                   MPI_SUM, MPI_COMM_WORLD, ierr)
+      if (timers_enabled) call timer_stop(t_rcomm)
+
+      do i = 1, nq
+         q(i-1) = x(i)
+      enddo
+
+      do 160 i = 0, nq - 1
+        gc = gc + q(i)
+ 160  continue
+
+      call timer_stop(1)
+      tm  = timer_read(1)
+
+      call mpi_allreduce(tm, x, 1, dp_type,
+     >                   MPI_MAX, MPI_COMM_WORLD, ierr)
+      tm = x(1)
+
+      if (node.eq.root) then
+         nit=0
+         verified = .true.
+         if (m.eq.24) then
+            sx_verify_value = -3.247834652034740D+3
+            sy_verify_value = -6.958407078382297D+3
+         elseif (m.eq.25) then
+            sx_verify_value = -2.863319731645753D+3
+            sy_verify_value = -6.320053679109499D+3
+         elseif (m.eq.28) then
+            sx_verify_value = -4.295875165629892D+3
+            sy_verify_value = -1.580732573678431D+4
+         elseif (m.eq.30) then
+            sx_verify_value =  4.033815542441498D+4
+            sy_verify_value = -2.660669192809235D+4
+         elseif (m.eq.32) then
+            sx_verify_value =  4.764367927995374D+4
+            sy_verify_value = -8.084072988043731D+4
+         elseif (m.eq.36) then
+            sx_verify_value =  1.982481200946593D+5
+            sy_verify_value = -1.020596636361769D+5
+         elseif (m.eq.40) then
+            sx_verify_value = -5.319717441530D+05
+            sy_verify_value = -3.688834557731D+05
+         else
+            verified = .false.
+         endif
+         if (verified) then
+            sx_err = abs((sx - sx_verify_value)/sx_verify_value)
+            sy_err = abs((sy - sy_verify_value)/sy_verify_value)
+            verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon))
+         endif
+         Mops = 2.d0**(m+1)/tm/1000000.d0
+
+         write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1)
+ 11      format ('EP Benchmark Results:'//'CPU Time =',f10.4/'N = 2^',
+     >           i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/
+     >           'Counts:'/(i3,0p,f15.0))
+
+         call print_results('EP', class, m+1, 0, 0, nit, npm, 
+     >                      no_nodes, tm, Mops, 
+     >                      'Random numbers generated', 
+     >                      verified, npbversion, compiletime, cs1,
+     >                      cs2, cs3, cs4, cs5, cs6, cs7)
+
+      endif
+
+
+      if (.not.timers_enabled) goto 999
+
+      do i = 1, t_last
+         t1m(i) = timer_read(i)
+      end do
+      t1m(t_last+2) = t1m(t_rcomm)
+      t1m(t_last+1) = t1m(t_total) - t1m(t_last+2)
+
+      call MPI_Reduce(t1m, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1m, tming, t_last+2, dp_type, MPI_MIN, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1m, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                0, MPI_COMM_WORLD, ierr)
+
+      if (node .eq. 0) then
+         write(*, 800) no_nodes
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / no_nodes
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/mpinpb.h
new file mode 100644
index 0000000..1f13637
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/EP/mpinpb.h
@@ -0,0 +1,9 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'mpif.h'
+
+      integer           me, nprocs, root, dp_type
+      common /mpistuff/ me, nprocs, root, dp_type
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/Makefile
new file mode 100644
index 0000000..1cc6e14
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=ft
+BENCHMARKU=FT
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = ft.o ${COMMON}/${RAND}.o ${COMMON}/print_results.o ${COMMON}/timers.o
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+
+
+.f.o:
+	${FCOMPILE} $<
+
+ft.o:             ft.f  global.h mpinpb.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f ft npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/README
new file mode 100644
index 0000000..ab08b36
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/README
@@ -0,0 +1,5 @@
+This code implements the time integration of a three-dimensional
+partial differential equation using the Fast Fourier Transform.
+Some of the dimension statements are not F77 conforming and will
+not work using the g77 compiler. All dimension statements,
+however, are legal F90.
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/ft.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/ft.f
new file mode 100644
index 0000000..8c46f14
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/ft.f
@@ -0,0 +1,2034 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   F T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!TO REDUCE THE AMOUNT OF MEMORY REQUIRED BY THE BENCHMARK WE NO LONGER
+!STORE THE ENTIRE TIME EVOLUTION ARRAY "EX" FOR ALL TIME STEPS, BUT
+!JUST FOR THE FIRST. ALSO, IT IS STORED ONLY FOR THE PART OF THE GRID
+!FOR WHICH THE CALLING PROCESSOR IS RESPONSIBLE, SO THAT THE MEMORY 
+!USAGE BECOMES SCALABLE. THIS NEW ARRAY IS CALLED "TWIDDLE" (SEE
+!NPB3.0-SER)
+
+!TO AVOID PROBLEMS WITH VERY LARGE ARRAY SIZES THAT ARE COMPUTED BY
+!MULTIPLYING GRID DIMENSIONS (CAUSING INTEGER OVERFLOW IN THE VARIABLE
+!NTOTAL) AND SUBSEQUENTLY DIVIDING BY THE NUMBER OF PROCESSORS, WE
+!COMPUTE THE SIZE OF ARRAY PARTITIONS MORE CONSERVATIVELY AS
+!((NX*NY)/NP)*NZ, WHERE NX, NY, AND NZ ARE GRID DIMENSIONS AND NP IS
+!THE NUMBER OF PROCESSORS, THE RESULT IS STORED IN "NTDIVNP". FOR THE 
+!PERFORMANCE CALCULATION WE STORE THE TOTAL NUMBER OF GRID POINTS IN A 
+!FLOATING POINT NUMBER "NTOTAL_F" INSTEAD OF AN INTEGER.
+!THIS FIX WILL FAIL IF THE NUMBER OF PROCESSORS IS SMALL.
+
+!UGLY HACK OF SUBROUTINE IPOW46: FOR VERY LARGE GRIDS THE SINGLE EXPONENT
+!FROM NPB2.3 MAY NOT FIT IN A 32-BIT INTEGER. HOWEVER, WE KNOW THAT THE
+!"EXPONENT" ARGUMENT OF THIS ROUTINE CAN ALWAYS BE FACTORED INTO A TERM 
+!DIVISIBLE BY NX (EXP_1) AND ANOTHER TERM (EXP_2). NX IS USUALLY A POWER
+!OF TWO, SO WE CAN KEEP HALVING IT UNTIL THE PRODUCT OF EXP_1
+!AND EXP_2 IS SMALL ENOUGH (NAMELY EXP_2 ITSELF). THIS UPDATED VERSION
+!OF IPWO46, WHICH NOW TAKES THE TWO FACTORS OF "EXPONENT" AS SEPARATE
+!ARGUMENTS, MAY BREAK DOWN IF EXP_1 DOES NOT CONTAIN A LARGE POWER OF TWO.
+
+c---------------------------------------------------------------------
+c
+c Authors: D. Bailey
+c          W. Saphir
+c          R. F. Van der Wijngaart
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c FT benchmark
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      program ft
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpif.h'
+      include 'global.h'
+      integer i, ierr
+      
+c---------------------------------------------------------------------
+c u0, u1, u2 are the main arrays in the problem. 
+c Depending on the decomposition, these arrays will have different 
+c dimensions. To accomodate all possibilities, we allocate them as 
+c one-dimensional arrays and pass them to subroutines for different 
+c views
+c  - u0 contains the initial (transformed) initial condition
+c  - u1 and u2 are working arrays
+c---------------------------------------------------------------------
+
+      double complex   u0(ntdivnp), 
+     >                 u1(ntdivnp), 
+     >                 u2(ntdivnp)
+      double precision twiddle(ntdivnp)
+c---------------------------------------------------------------------
+c Large arrays are in common so that they are allocated on the
+c heap rather than the stack. This common block is not
+c referenced directly anywhere else. Padding is to avoid accidental 
+c cache problems, since all array sizes are powers of two.
+c---------------------------------------------------------------------
+
+      double complex pad1(3), pad2(3), pad3(3)
+      common /bigarrays/ u0, pad1, u1, pad2, u2, pad3, twiddle
+
+      integer iter
+      double precision total_time, mflops
+      logical verified
+      character class
+
+      call MPI_Init(ierr)
+
+c---------------------------------------------------------------------
+c Run the entire problem once to make sure all data is touched. 
+c This reduces variable startup costs, which is important for such a 
+c short benchmark. The other NPB 2 implementations are similar. 
+c---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_init)
+      call setup()
+      call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3))
+      call compute_initial_conditions(u1, dims(1,1), dims(2,1), 
+     >                                dims(3,1))
+      call fft_init (dims(1,1))
+      call fft(1, u1, u0)
+      call timer_stop(T_init)
+      if (me .eq. 0) then
+         print *,'Initialization time =',timer_read(T_init)
+      endif
+
+c---------------------------------------------------------------------
+c Start over from the beginning. Note that all operations must
+c be timed, in contrast to other benchmarks. 
+c---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+      call MPI_Barrier(MPI_COMM_WORLD, ierr)
+
+      call timer_start(T_total)
+      if (timers_enabled) call timer_start(T_setup)
+
+      call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3))
+      call compute_initial_conditions(u1, dims(1,1), dims(2,1), 
+     >                                dims(3,1))
+      call fft_init (dims(1,1))
+
+!      if (timers_enabled) call synchup()
+      if (timers_enabled) call timer_stop(T_setup)
+
+      if (timers_enabled) call timer_start(T_fft)
+      call fft(1, u1, u0)
+      if (timers_enabled) call timer_stop(T_fft)
+
+      do iter = 1, niter
+         if (timers_enabled) call timer_start(T_evolve)
+         call evolve(u0, u1, twiddle, dims(1,1), dims(2,1), dims(3,1))
+         if (timers_enabled) call timer_stop(T_evolve)
+         if (timers_enabled) call timer_start(T_fft)
+         call fft(-1, u1, u2)
+         if (timers_enabled) call timer_stop(T_fft)
+!         if (timers_enabled) call synchup()
+         if (timers_enabled) call timer_start(T_checksum)
+         call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1))
+         if (timers_enabled) call timer_stop(T_checksum)
+      end do
+
+      call verify(nx, ny, nz, niter, verified, class)
+      call timer_stop(t_total)
+      if (np .ne. np_min) verified = .false.
+      total_time = timer_read(t_total)
+
+      if( total_time .ne. 0. ) then
+         mflops = 1.0d-6*ntotal_f *
+     >             (14.8157+7.19641*log(ntotal_f)
+     >          +  (5.23518+7.21113*log(ntotal_f))*niter)
+     >                 /total_time
+      else
+         mflops = 0.0
+      endif
+      if (me .eq. 0) then
+         call print_results('FT', class, nx, ny, nz, niter, np_min, np,
+     >     total_time, mflops, '          floating point', verified, 
+     >     npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      endif
+      if (timers_enabled) call print_timers()
+      call MPI_Finalize(ierr)
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine evolve(u0, u1, twiddle, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c evolve u0 -> u1 (t time steps) in fourier space
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double precision exi
+      double complex u0(d1,d2,d3)
+      double complex u1(d1,d2,d3)
+      double precision twiddle(d1,d2,d3)
+      integer i, j, k
+
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = u0(i,j,k)*(twiddle(i,j,k))
+               u1(i,j,k) = u0(i,j,k)
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_initial_conditions(u0, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c Fill in array u0 with initial conditions from 
+c random number generator 
+c---------------------------------------------------------------------
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex u0(d1, d2, d3)
+      integer k
+      double precision x0, start, an, dummy
+      
+c---------------------------------------------------------------------
+c 0-D and 1-D layouts are easy because each processor gets a contiguous
+c chunk of the array, in the Fortran ordering sense. 
+c For a 2-D layout, it's a bit more complicated. We always
+c have entire x-lines (contiguous) in processor. 
+c We can do ny/np1 of them at a time since we have
+c ny/np1 contiguous in y-direction. But then we jump
+c by z-planes (nz/np2 of them, total). 
+c For the 0-D and 1-D layouts we could do larger chunks, but
+c this turns out to have no measurable impact on performance. 
+c---------------------------------------------------------------------
+
+
+      start = seed                                    
+c---------------------------------------------------------------------
+c Jump to the starting element for our first plane.
+c---------------------------------------------------------------------
+      call ipow46(a, 2*nx, (zstart(1)-1)*ny + (ystart(1)-1), an)
+      dummy = randlc(start, an)
+      call ipow46(a, 2*nx, ny, an)
+      
+c---------------------------------------------------------------------
+c Go through by z planes filling in one square at a time.
+c---------------------------------------------------------------------
+      do k = 1, dims(3, 1) ! nz/np2
+         x0 = start
+         call vranlc(2*nx*dims(2, 1), x0, a, u0(1, 1, k))
+         if (k .ne. dims(3, 1)) dummy = randlc(start, an)
+      end do
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ipow46(a, exp_1, exp_2, result)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute a^exponent mod 2^46
+c---------------------------------------------------------------------
+
+      implicit none
+      double precision a, result, dummy, q, r
+      integer exp_1, exp_2, n, n2, ierr
+      external randlc
+      double precision randlc
+      logical  two_pow
+c---------------------------------------------------------------------
+c Use
+c   a^n = a^(n/2)*a^(n/2) if n even else
+c   a^n = a*a^(n-1)       if n odd
+c---------------------------------------------------------------------
+      result = 1
+      if (exp_2 .eq. 0 .or. exp_1 .eq. 0) return
+      q = a
+      r = 1
+      n = exp_1
+      two_pow = .true.
+
+      do while (two_pow)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q)
+            n = n2
+         else
+            n = n * exp_2
+            two_pow = .false.
+         endif
+      end do
+
+      do while (n .gt. 1)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q) 
+            n = n2
+         else
+            dummy = randlc(r, q)
+            n = n-1
+         endif
+      end do
+      dummy = randlc(r, q)
+      result = r
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'mpinpb.h'
+      include 'global.h'
+
+      integer ierr, i, j, fstatus
+      debug = .FALSE.
+      
+      call MPI_Comm_size(MPI_COMM_WORLD, np, ierr)
+      call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr)
+
+      if (.not. convertdouble) then
+         dc_type = MPI_DOUBLE_COMPLEX
+      else
+         dc_type = MPI_COMPLEX
+      endif
+
+
+      if (me .eq. 0) then
+         write(*, 1000)
+
+         open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+         timers_enabled = .false.
+         if (fstatus .eq. 0) then
+            timers_enabled = .true.
+            close(2)
+         endif
+
+         open (unit=2,file='inputft.data',status='old', iostat=fstatus)
+
+         if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputft.data')
+            read (2,*) niter
+            read (2,*) layout_type
+            read (2,*) np1, np2
+            close(2)
+
+c---------------------------------------------------------------------
+c check to make sure input data is consistent
+c---------------------------------------------------------------------
+
+    
+c---------------------------------------------------------------------
+c 1. product of processor grid dims must equal number of processors
+c---------------------------------------------------------------------
+
+            if (np1 * np2 .ne. np) then
+               write(*, 238)
+ 238           format(' np1 and np2 given in input file are not valid.')
+               write(*, 239) np1*np2, np
+ 239           format(' Product is ', i5, ' and should be ', i5)
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+c---------------------------------------------------------------------
+c 2. layout type must be valid
+c---------------------------------------------------------------------
+
+            if (layout_type .ne. layout_0D .and.
+     >          layout_type .ne. layout_1D .and.
+     >          layout_type .ne. layout_2D) then
+               write(*, 240)
+ 240           format(' Layout type specified in inputft.data is 
+     >                  invalid ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+c---------------------------------------------------------------------
+c 3. 0D layout must be 1x1 grid
+c---------------------------------------------------------------------
+
+            if (layout_type .eq. layout_0D .and.
+     >            (np1 .ne.1 .or. np2 .ne. 1)) then
+               write(*, 241)
+ 241           format(' For 0D layout, both np1 and np2 must be 1 ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+c---------------------------------------------------------------------
+c 4. 1D layout must be 1xN grid
+c---------------------------------------------------------------------
+
+            if (layout_type .eq. layout_1D .and. np1 .ne. 1) then
+               write(*, 242)
+ 242           format(' For 1D layout, np1 must be 1 ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+         else
+            write(*,234) 
+            niter = niter_default
+            if (np .eq. 1) then
+               np1 = 1
+               np2 = 1
+               layout_type = layout_0D
+            else if (np .le. nz) then
+               np1 = 1
+               np2 = np
+               layout_type = layout_1D
+            else
+               np1 = nz
+               np2 = np/nz
+               layout_type = layout_2D
+            endif
+         endif
+
+         if (np .lt. np_min) then
+            write(*, 10) np_min
+ 10         format(' Error: Compiled for ', I5, ' processors. ')
+            write(*, 11) np
+ 11         format(' Only ',  i5, ' processors found ')
+            call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+         endif
+
+ 234     format(' No input file inputft.data. Using compiled defaults')
+         write(*, 1001) nx, ny, nz
+         write(*, 1002) niter
+         write(*, 1004) np
+         write(*, 1005) np1, np2
+         if (np .ne. np_min) write(*, 1006) np_min
+
+         if (layout_type .eq. layout_0D) then
+            write(*, 1010) '0D'
+         else if (layout_type .eq. layout_1D) then
+            write(*, 1010) '1D'
+         else
+            write(*, 1010) '2D'
+         endif
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.3 -- FT Benchmark',/)
+ 1001    format(' Size                : ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations          : ', 7x, i7)
+ 1004    format(' Number of processes : ', 7x, i7)
+ 1005    format(' Processor array     : ', 5x, i4, 'x', i4)
+ 1006    format(' WARNING: compiled for ', i5, ' processes. ',
+     >          ' Will not verify. ')
+ 1010    format(' Layout type         : ', 9x, A5)
+      endif
+
+
+c---------------------------------------------------------------------
+c Broadcast parameters 
+c---------------------------------------------------------------------
+      call MPI_BCAST(np1, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(np2, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(layout_type, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, 
+     &               ierr)
+      call MPI_BCAST(niter, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(timers_enabled, 1, MPI_LOGICAL, 0, MPI_COMM_WORLD,
+     &               ierr)
+
+      if (np1 .eq. 1 .and. np2 .eq. 1) then
+        layout_type = layout_0D
+      else if (np1 .eq. 1) then
+         layout_type = layout_1D
+      else
+         layout_type = layout_2D
+      endif
+
+      if (layout_type .eq. layout_0D) then
+         do i = 1, 3
+            dims(1, i) = nx
+            dims(2, i) = ny
+            dims(3, i) = nz
+         end do
+      else if (layout_type .eq. layout_1D) then
+         dims(1, 1) = nx
+         dims(2, 1) = ny
+         dims(3, 1) = nz
+
+         dims(1, 2) = nx
+         dims(2, 2) = ny
+         dims(3, 2) = nz
+
+         dims(1, 3) = nz
+         dims(2, 3) = nx
+         dims(3, 3) = ny
+      else if (layout_type .eq. layout_2D) then
+         dims(1, 1) = nx
+         dims(2, 1) = ny
+         dims(3, 1) = nz
+
+         dims(1, 2) = ny
+         dims(2, 2) = nx
+         dims(3, 2) = nz
+
+         dims(1, 3) = nz
+         dims(2, 3) = nx
+         dims(3, 3) = ny
+
+      endif
+      do i = 1, 3
+         dims(2, i) = dims(2, i) / np1
+         dims(3, i) = dims(3, i) / np2
+      end do
+
+
+c---------------------------------------------------------------------
+c Determine processor coordinates of this processor
+c Processor grid is np1xnp2. 
+c Arrays are always (n1, n2/np1, n3/np2)
+c Processor coords are zero-based. 
+c---------------------------------------------------------------------
+      me2 = mod(me, np2)  ! goes from 0...np2-1
+      me1 = me/np2        ! goes from 0...np1-1
+c---------------------------------------------------------------------
+c Communicators for rows/columns of processor grid. 
+c commslice1 is communicator of all procs with same me1, ranked as me2
+c commslice2 is communicator of all procs with same me2, ranked as me1
+c mpi_comm_split(comm, color, key, ...)
+c---------------------------------------------------------------------
+      call MPI_Comm_split(MPI_COMM_WORLD, me1, me2, commslice1, ierr)
+      call MPI_Comm_split(MPI_COMM_WORLD, me2, me1, commslice2, ierr)
+!      if (timers_enabled) call synchup()
+
+      if (debug) print *, 'proc coords: ', me, me1, me2
+
+c---------------------------------------------------------------------
+c Determine which section of the grid is owned by this
+c processor. 
+c---------------------------------------------------------------------
+      if (layout_type .eq. layout_0d) then
+
+         do i = 1, 3
+            xstart(i) = 1
+            xend(i)   = nx
+            ystart(i) = 1
+            yend(i)   = ny
+            zstart(i) = 1
+            zend(i)   = nz
+         end do
+
+      else if (layout_type .eq. layout_1d) then
+
+         xstart(1) = 1
+         xend(1)   = nx
+         ystart(1) = 1
+         yend(1)   = ny
+         zstart(1) = 1 + me2 * nz/np2
+         zend(1)   = (me2+1) * nz/np2
+
+         xstart(2) = 1
+         xend(2)   = nx
+         ystart(2) = 1
+         yend(2)   = ny
+         zstart(2) = 1 + me2 * nz/np2
+         zend(2)   = (me2+1) * nz/np2
+
+         xstart(3) = 1
+         xend(3)   = nx
+         ystart(3) = 1 + me2 * ny/np2
+         yend(3)   = (me2+1) * ny/np2
+         zstart(3) = 1
+         zend(3)   = nz
+
+      else if (layout_type .eq. layout_2d) then
+
+         xstart(1) = 1
+         xend(1)   = nx
+         ystart(1) = 1 + me1 * ny/np1
+         yend(1)   = (me1+1) * ny/np1
+         zstart(1) = 1 + me2 * nz/np2
+         zend(1)   = (me2+1) * nz/np2
+
+         xstart(2) = 1 + me1 * nx/np1
+         xend(2)   = (me1+1)*nx/np1
+         ystart(2) = 1
+         yend(2)   = ny
+         zstart(2) = zstart(1)
+         zend(2)   = zend(1)
+
+         xstart(3) = xstart(2)
+         xend(3)   = xend(2)
+         ystart(3) = 1 + me2 *ny/np2
+         yend(3)   = (me2+1)*ny/np2
+         zstart(3) = 1
+         zend(3)   = nz
+      endif
+
+c---------------------------------------------------------------------
+c Set up info for blocking of ffts and transposes.  This improves
+c performance on cache-based systems. Blocking involves
+c working on a chunk of the problem at a time, taking chunks
+c along the first, second, or third dimension. 
+c
+c - In cffts1 blocking is on 2nd dimension (with fft on 1st dim)
+c - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims)
+
+c Since 1st dim is always in processor, we'll assume it's long enough 
+c (default blocking factor is 16 so min size for 1st dim is 16)
+c The only case we have to worry about is cffts1 in a 2d decomposition. 
+c so the blocking factor should not be larger than the 2nd dimension. 
+c---------------------------------------------------------------------
+
+      fftblock = fftblock_default
+      fftblockpad = fftblockpad_default
+
+      if (layout_type .eq. layout_2d) then
+         if (dims(2, 1) .lt. fftblock) fftblock = dims(2, 1)
+         if (dims(2, 2) .lt. fftblock) fftblock = dims(2, 2)
+         if (dims(2, 3) .lt. fftblock) fftblock = dims(2, 3)
+      endif
+      
+      if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3
+
+      return
+      end
+
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_indexmap(twiddle, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2 
+c for time evolution exponent. 
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'mpinpb.h'
+      include 'global.h'
+      integer d1, d2, d3
+      integer i, j, k, ii, ii2, jj, ij2, kk
+      double precision ap, twiddle(d1, d2, d3)
+
+c---------------------------------------------------------------------
+c this function is very different depending on whether 
+c we are in the 0d, 1d or 2d layout. Compute separately. 
+c basically we want to convert the fortran indices 
+c   1 2 3 4 5 6 7 8 
+c to 
+c   0 1 2 3 -4 -3 -2 -1
+c The following magic formula does the trick:
+c mod(i-1+n/2, n) - n/2
+c---------------------------------------------------------------------
+
+      ap = - 4.d0 * alpha * pi *pi
+
+      if (layout_type .eq. layout_0d) then ! xyz layout
+         do i = 1, dims(1,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1, dims(2,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k = 1, dims(3,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(i,j,k) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else if (layout_type .eq. layout_1d) then ! zxy layout 
+         do i = 1,dims(2,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1,dims(3,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k = 1,dims(1,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else if (layout_type .eq. layout_2d) then ! zxy layout
+         do i = 1,dims(2,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1, dims(3,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k =1,dims(1,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else
+         print *, ' Unknown layout type ', layout_type
+         stop
+      endif
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine print_timers()
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer i, ierr
+      include 'global.h'
+      include 'mpinpb.h'
+      character*25 tstrings(T_max+2)
+      double precision t1(T_max+2), tsum(T_max+2),
+     >                 tming(T_max+2), tmaxg(T_max+2)
+      data tstrings / '          total ', 
+     >                '          setup ', 
+     >                '            fft ', 
+     >                '         evolve ', 
+     >                '       checksum ', 
+     >                '         fftlow ', 
+     >                '        fftcopy ', 
+     >                '      transpose ', 
+     >                ' transpose1_loc ', 
+     >                ' transpose1_glo ', 
+     >                ' transpose1_fin ', 
+     >                ' transpose2_loc ', 
+     >                ' transpose2_glo ', 
+     >                ' transpose2_fin ', 
+     >                '           sync ',
+     >                '           init ',
+     >                '        totcomp ',
+     >                '        totcomm ' /
+
+      do i = 1, t_max
+         t1(i) = timer_read(i)
+      end do
+      t1(t_max+2) = t1(t_transxzglo) + t1(t_transxyglo) + t1(t_synch)
+      t1(t_max+1) = t1(t_total) - t1(t_max+2)
+
+      call MPI_Reduce(t1, tsum,  t_max+2, MPI_DOUBLE_PRECISION, 
+     >                MPI_SUM, 0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tming, t_max+2, MPI_DOUBLE_PRECISION, 
+     >                MPI_MIN, 0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tmaxg, t_max+2, MPI_DOUBLE_PRECISION, 
+     >                MPI_MAX, 0, MPI_COMM_WORLD, ierr)
+
+      if (me .ne. 0) return
+      write(*, 800) np
+      do i = 1, t_max+2
+         if (tsum(i) .ne. 0.0d0) then
+            write(*, 810) i, tstrings(i), tming(i), tmaxg(i), tsum(i)/np
+         endif
+      end do
+ 800  format(' nprocs =', i6, 19x, 'minimum', 5x, 'maximum', 
+     >       5x, 'average')
+ 810  format(' timer ', i2, '(', A16, ') :', 3(2X,F10.4))
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fft(dir, x1, x2)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer dir
+      double complex x1(ntdivnp), x2(ntdivnp)
+
+      double complex scratch(fftblockpad_default*maxdim*2)
+
+c---------------------------------------------------------------------
+c note: args x1, x2 must be different arrays
+c note: args for cfftsx are (direction, layout, xin, xout, scratch)
+c       xin/xout may be the same and it can be somewhat faster
+c       if they are
+c note: args for transpose are (layout1, layout2, xin, xout)
+c       xin/xout must be different
+c---------------------------------------------------------------------
+
+      if (dir .eq. 1) then
+         if (layout_type .eq. layout_0d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x1, x1, scratch)
+            call cffts2(1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x1, x1, scratch)
+            call cffts3(1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x1, x2, scratch)
+         else if (layout_type .eq. layout_1d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x1, x1, scratch)
+            call cffts2(1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_xy_z(2, 3, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x2, x2, scratch)
+         else if (layout_type .eq. layout_2d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_y(1, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x2, x2, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_z(2, 3, x2, x1)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x1, x2, scratch)
+         endif
+      else
+         if (layout_type .eq. layout_0d) then
+            call cffts3(-1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x1, x1, scratch)
+            call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x1, x1, scratch)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x1, x2, scratch)
+         else if (layout_type .eq. layout_1d) then
+            call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_yz(3, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x2, x2, scratch)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x2, x2, scratch)
+         else if (layout_type .eq. layout_2d) then
+            call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), 
+     >                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_z(3, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(-1, dims(1,2), dims(2,2), dims(3,2), 
+     >                  x2, x2, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_y(2, 1, x2, x1)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), 
+     >                  x1, x2, scratch)
+         endif
+      endif
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts1(is, d1, d2, d3, x, xout, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd1
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d1, 2) 
+      integer i, j, k, jj
+
+      logd1 = ilog2(d1)
+
+      do k = 1, d3
+         do jj = 0, d2 - fftblock, fftblock
+            if (timers_enabled) call timer_start(T_fftcopy)
+            do j = 1, fftblock
+               do i = 1, d1
+                  y(j,i,1) = x(i,j+jj,k)
+               enddo
+            enddo
+            if (timers_enabled) call timer_stop(T_fftcopy)
+            
+            if (timers_enabled) call timer_start(T_fftlow)
+            call cfftz (is, logd1, d1, y, y(1,1,2))
+            if (timers_enabled) call timer_stop(T_fftlow)
+
+            if (timers_enabled) call timer_start(T_fftcopy)
+            do j = 1, fftblock
+               do i = 1, d1
+                  xout(i,j+jj,k) = y(j,i,1)
+               enddo
+            enddo
+            if (timers_enabled) call timer_stop(T_fftcopy)
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts2(is, d1, d2, d3, x, xout, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd2
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d2, 2) 
+      integer i, j, k, ii
+
+      logd2 = ilog2(d2)
+
+      do k = 1, d3
+        do ii = 0, d1 - fftblock, fftblock
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do j = 1, d2
+              do i = 1, fftblock
+                 y(i,j,1) = x(i+ii,j,k)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+
+           if (timers_enabled) call timer_start(T_fftlow)
+           call cfftz (is, logd2, d2, y, y(1, 1, 2))
+           if (timers_enabled) call timer_stop(T_fftlow)
+
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do j = 1, d2
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y(i,j,1)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+        enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts3(is, d1, d2, d3, x, xout, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd3
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d3, 2) 
+      integer i, j, k, ii
+
+      logd3 = ilog2(d3)
+
+      do j = 1, d2
+        do ii = 0, d1 - fftblock, fftblock
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do k = 1, d3
+              do i = 1, fftblock
+                 y(i,k,1) = x(i+ii,j,k)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+
+           if (timers_enabled) call timer_start(T_fftlow)
+           call cfftz (is, logd3, d3, y, y(1, 1, 2))
+           if (timers_enabled) call timer_stop(T_fftlow)
+
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do k = 1, d3
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y(i,k,1)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+        enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fft_init (n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute the roots-of-unity array that will be used for subsequent FFTs. 
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+
+      integer m,n,nu,ku,i,j,ln
+      double precision t, ti
+
+
+c---------------------------------------------------------------------
+c   Initialize the U array with sines and cosines in a manner that permits
+c   stride one access at each FFT iteration.
+c---------------------------------------------------------------------
+      nu = n
+      m = ilog2(n)
+      u(1) = m
+      ku = 2
+      ln = 1
+
+      do j = 1, m
+         t = pi / ln
+         
+         do i = 0, ln - 1
+            ti = i * t
+            u(i+ku) = dcmplx (cos (ti), sin(ti))
+         enddo
+         
+         ku = ku + ln
+         ln = 2 * ln
+      enddo
+      
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cfftz (is, m, n, x, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   Computes NY N-point complex-to-complex FFTs of X using an algorithm due
+c   to Swarztrauber.  X is both the input and the output array, while Y is a 
+c   scratch array.  It is assumed that N = 2^M.  Before calling CFFTZ to 
+c   perform FFTs, the array U must be initialized by calling CFFTZ with IS 
+c   set to 0 and M set to MX, where MX is the maximum value of M for any 
+c   subsequent call.
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+
+      integer is,m,n,i,j,l,mx
+      double complex x, y
+
+      dimension x(fftblockpad,n), y(fftblockpad,n)
+
+c---------------------------------------------------------------------
+c   Check if input parameters are invalid.
+c---------------------------------------------------------------------
+      mx = u(1)
+      if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx)    
+     >  then
+        write (*, 1)  is, m, mx
+ 1      format ('CFFTZ: Either U has not been initialized, or else'/    
+     >    'one of the input parameters is invalid', 3I5)
+        stop
+      endif
+
+c---------------------------------------------------------------------
+c   Perform one variant of the Stockham FFT.
+c---------------------------------------------------------------------
+      do l = 1, m, 2
+        call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y)
+        if (l .eq. m) goto 160
+        call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x)
+      enddo
+
+      goto 180
+
+c---------------------------------------------------------------------
+c   Copy Y to X.
+c---------------------------------------------------------------------
+ 160  do j = 1, n
+        do i = 1, fftblock
+          x(i,j) = y(i,j)
+        enddo
+      enddo
+
+ 180  continue
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   Performs the L-th iteration of the second variant of the Stockham FFT.
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22
+      double complex u,x,y,u1,x11,x21
+      dimension u(n), x(ny1,n), y(ny1,n)
+
+
+c---------------------------------------------------------------------
+c   Set initial parameters.
+c---------------------------------------------------------------------
+
+      n1 = n / 2
+      lk = 2 ** (l - 1)
+      li = 2 ** (m - l)
+      lj = 2 * lk
+      ku = li + 1
+
+      do i = 0, li - 1
+        i11 = i * lk + 1
+        i12 = i11 + n1
+        i21 = i * lj + 1
+        i22 = i21 + lk
+        if (is .ge. 1) then
+          u1 = u(ku+i)
+        else
+          u1 = dconjg (u(ku+i))
+        endif
+
+c---------------------------------------------------------------------
+c   This loop is vectorizable.
+c---------------------------------------------------------------------
+        do k = 0, lk - 1
+          do j = 1, ny
+            x11 = x(j,i11+k)
+            x21 = x(j,i12+k)
+            y(j,i21+k) = x11 + x21
+            y(j,i22+k) = u1 * (x11 - x21)
+          enddo
+        enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      integer function ilog2(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n, nn, lg
+      if (n .eq. 1) then
+         ilog2=0
+         return
+      endif
+      lg = 1
+      nn = 2
+      do while (nn .lt. n)
+         nn = nn*2
+         lg = lg+1
+      end do
+      ilog2 = lg
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_yz(l1, l2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose2_local(dims(1,l1),dims(2, l1)*dims(3, l1),
+     >                          xin, xout)
+
+      call transpose2_global(xout, xin)
+
+      call transpose2_finish(dims(1,l1),dims(2, l1)*dims(3, l1),
+     >                          xin, xout)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_xy_z(l1, l2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose2_local(dims(1,l1)*dims(2, l1),dims(3, l1),
+     >                          xin, xout)
+      call transpose2_global(xout, xin)
+      call transpose2_finish(dims(1,l1)*dims(2, l1),dims(3, l1),
+     >                          xin, xout)
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose2_local(n1, n2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'mpinpb.h'
+      include 'global.h'
+      integer n1, n2
+      double complex xin(n1, n2), xout(n2, n1)
+      
+      double complex z(transblockpad, transblock)
+
+      integer i, j, ii, jj
+
+      if (timers_enabled) call timer_start(T_transxzloc)
+
+c---------------------------------------------------------------------
+c If possible, block the transpose for cache memory systems. 
+c How much does this help? Example: R8000 Power Challenge (90 MHz)
+c Blocked version decreases time spend in this routine 
+c from 14 seconds to 5.2 seconds on 8 nodes class A.
+c---------------------------------------------------------------------
+
+      if (n1 .lt. transblock .or. n2 .lt. transblock) then
+         if (n1 .ge. n2) then 
+            do j = 1, n2
+               do i = 1, n1
+                  xout(j, i) = xin(i, j)
+               end do
+            end do
+         else
+            do i = 1, n1
+               do j = 1, n2
+                  xout(j, i) = xin(i, j)
+               end do
+            end do
+         endif
+      else
+         do j = 0, n2-1, transblock
+            do i = 0, n1-1, transblock
+               
+c---------------------------------------------------------------------
+c Note: compiler should be able to take j+jj out of inner loop
+c---------------------------------------------------------------------
+               do jj = 1, transblock
+                  do ii = 1, transblock
+                     z(jj,ii) = xin(i+ii, j+jj)
+                  end do
+               end do
+               
+               do ii = 1, transblock
+                  do jj = 1, transblock
+                     xout(j+jj, i+ii) = z(jj,ii)
+                  end do
+               end do
+               
+            end do
+         end do
+      endif
+      if (timers_enabled) call timer_stop(T_transxzloc)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose2_global(xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      double complex xin(ntdivnp)
+      double complex xout(ntdivnp) 
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+      if (timers_enabled) call timer_start(T_transxzglo)
+      call mpi_alltoall(xin, ntdivnp/np, dc_type,
+     >                  xout, ntdivnp/np, dc_type,
+     >                  commslice1, ierr)
+      if (timers_enabled) call timer_stop(T_transxzglo)
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose2_finish(n1, n2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer n1, n2, ioff
+      double complex xin(n2, n1/np2, 0:np2-1), xout(n2*np2, n1/np2)
+      
+      integer i, j, p
+
+      if (timers_enabled) call timer_start(T_transxzfin)
+      do p = 0, np2-1
+         ioff = p*n2
+         do j = 1, n1/np2
+            do i = 1, n2
+               xout(i+ioff, j) = xin(i, j, p)
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxzfin)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_z(l1, l2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose_x_z_local(dims(1,l1),dims(2,l1),dims(3,l1),
+     >                         xin, xout)
+      call transpose_x_z_global(dims(1,l1),dims(2,l1),dims(3,l1), 
+     >                          xout, xin)
+      call transpose_x_z_finish(dims(1,l2),dims(2,l2),dims(3,l2), 
+     >                          xin, xout)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_z_local(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex xin(d1,d2,d3)
+      double complex xout(d3,d2,d1)
+      integer block1, block3
+      integer i, j, k, kk, ii, i1, k1
+
+      double complex buf(transblockpad, maxdim)
+      if (timers_enabled) call timer_start(T_transxzloc)
+      if (d1 .lt. 32) goto 100
+      block3 = d3
+      if (block3 .eq. 1)  goto 100
+      if (block3 .gt. transblock) block3 = transblock
+      block1 = d1
+      if (block1*block3 .gt. transblock*transblock) 
+     >          block1 = transblock*transblock/block3
+c---------------------------------------------------------------------
+c blocked transpose
+c---------------------------------------------------------------------
+      do j = 1, d2
+         do kk = 0, d3-block3, block3
+            do ii = 0, d1-block1, block1
+               
+               do k = 1, block3
+                  k1 = k + kk
+                  do i = 1, block1
+                     buf(k, i) = xin(i+ii, j, k1)
+                  end do
+               end do
+
+               do i = 1, block1
+                  i1 = i + ii
+                  do k = 1, block3
+                     xout(k+kk, j, i1) = buf(k, i)
+                  end do
+               end do
+
+            end do
+         end do
+      end do
+      goto 200
+      
+
+c---------------------------------------------------------------------
+c basic transpose
+c---------------------------------------------------------------------
+ 100  continue
+      
+      do j = 1, d2
+         do k = 1, d3
+            do i = 1, d1
+               xout(k, j, i) = xin(i, j, k)
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c all done
+c---------------------------------------------------------------------
+ 200  continue
+
+      if (timers_enabled) call timer_stop(T_transxzloc)
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_z_global(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      integer d1, d2, d3
+      double complex xin(d3,d2,d1)
+      double complex xout(d3,d2,d1) ! not real layout, but right size
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+c---------------------------------------------------------------------
+c do transpose among all  processes with same 1-coord (me1)
+c---------------------------------------------------------------------
+      if (timers_enabled)call timer_start(T_transxzglo)
+      call mpi_alltoall(xin, d1*d2*d3/np2, dc_type,
+     >                  xout, d1*d2*d3/np2, dc_type,
+     >                  commslice1, ierr)
+      if (timers_enabled) call timer_stop(T_transxzglo)
+      return
+      end
+      
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_z_finish(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex xin(d1/np2, d2, d3, 0:np2-1)
+      double complex xout(d1,d2,d3)
+      integer i, j, k, p, ioff
+      if (timers_enabled) call timer_start(T_transxzfin)
+c---------------------------------------------------------------------
+c this is the most straightforward way of doing it. the
+c calculation in the inner loop doesn't help. 
+c      do i = 1, d1/np2
+c         do j = 1, d2
+c            do k = 1, d3
+c               do p = 0, np2-1
+c                  ii = i + p*d1/np2
+c                  xout(ii, j, k) = xin(i, j, k, p)
+c               end do
+c            end do
+c         end do
+c      end do
+c---------------------------------------------------------------------
+
+      do p = 0, np2-1
+         ioff = p*d1/np2
+         do k = 1, d3
+            do j = 1, d2
+               do i = 1, d1/np2
+                  xout(i+ioff, j, k) = xin(i, j, k, p)
+               end do
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxzfin)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_y(l1, l2, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+c---------------------------------------------------------------------
+c xy transpose is a little tricky, since we don't want
+c to touch 3rd axis. But alltoall must involve 3rd axis (most 
+c slowly varying) to be efficient. So we do
+c (nx, ny/np1, nz/np2) -> (ny/np1, nz/np2, nx) (local)
+c (ny/np1, nz/np2, nx) -> ((ny/np1*nz/np2)*np1, nx/np1) (global)
+c then local finish. 
+c---------------------------------------------------------------------
+
+
+      call transpose_x_y_local(dims(1,l1),dims(2,l1),dims(3,l1),
+     >                         xin, xout)
+      call transpose_x_y_global(dims(1,l1),dims(2,l1),dims(3,l1), 
+     >                          xout, xin)
+      call transpose_x_y_finish(dims(1,l2),dims(2,l2),dims(3,l2), 
+     >                          xin, xout)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_y_local(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex xin(d1, d2, d3)
+      double complex xout(d2, d3, d1)
+      integer i, j, k
+      if (timers_enabled) call timer_start(T_transxyloc)
+
+      do k = 1, d3
+         do i = 1, d1
+            do j = 1, d2
+               xout(j,k,i)=xin(i,j,k)
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxyloc)
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_y_global(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      integer d1, d2, d3
+c---------------------------------------------------------------------
+c array is in form (ny/np1, nz/np2, nx)
+c---------------------------------------------------------------------
+      double complex xin(d2,d3,d1)
+      double complex xout(d2,d3,d1) ! not real layout but right size
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+c---------------------------------------------------------------------
+c do transpose among all processes with same 1-coord (me1)
+c---------------------------------------------------------------------
+      if (timers_enabled) call timer_start(T_transxyglo)
+      call mpi_alltoall(xin, d1*d2*d3/np1, dc_type,
+     >                  xout, d1*d2*d3/np1, dc_type,
+     >                  commslice2, ierr)
+      if (timers_enabled) call timer_stop(T_transxyglo)
+
+      return
+      end
+      
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine transpose_x_y_finish(d1, d2, d3, xin, xout)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex xin(d1/np1, d3, d2, 0:np1-1)
+      double complex xout(d1,d2,d3)
+      integer i, j, k, p, ioff
+      if (timers_enabled) call timer_start(T_transxyfin)
+c---------------------------------------------------------------------
+c this is the most straightforward way of doing it. the
+c calculation in the inner loop doesn't help. 
+c      do i = 1, d1/np1
+c         do j = 1, d2
+c            do k = 1, d3
+c               do p = 0, np1-1
+c                  ii = i + p*d1/np1
+c note order is screwy bcz we have (ny/np1, nz/np2, nx) -> (ny, nx/np1, nz/np2)
+c                  xout(ii, j, k) = xin(i, k, j, p)
+c               end do
+c            end do
+c         end do
+c      end do
+c---------------------------------------------------------------------
+
+      do p = 0, np1-1
+         ioff = p*d1/np1
+         do k = 1, d3
+            do j = 1, d2
+               do i = 1, d1/np1
+                  xout(i+ioff, j, k) = xin(i, k, j, p)
+               end do
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxyfin)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine checksum(i, u1, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      integer i, d1, d2, d3
+      double complex u1(d1, d2, d3)
+      integer j, q,r,s, ierr
+      double complex chk,allchk
+      chk = (0.0,0.0)
+
+      do j=1,1024
+         q = mod(j, nx)+1
+         if (q .ge. xstart(1) .and. q .le. xend(1)) then
+            r = mod(3*j,ny)+1
+            if (r .ge. ystart(1) .and. r .le. yend(1)) then
+               s = mod(5*j,nz)+1
+               if (s .ge. zstart(1) .and. s .le. zend(1)) then
+                  chk=chk+u1(q-xstart(1)+1,r-ystart(1)+1,s-zstart(1)+1)
+               end if
+            end if
+         end if
+      end do
+      chk = chk/ntotal_f
+
+      if (timers_enabled) call timer_start(T_synch)
+      call MPI_Reduce(chk, allchk, 1, dc_type, MPI_SUM, 
+     >                0, MPI_COMM_WORLD, ierr)      
+      if (timers_enabled) call timer_stop(T_synch)
+      if (me .eq. 0) then
+            write (*, 30) i, allchk
+ 30         format (' T =',I5,5X,'Checksum =',1P2D22.12)
+      endif
+
+c      sums(i) = allchk
+c     If we compute the checksum for diagnostic purposes, we let i be
+c     negative, so the result will not be stored in an array
+      if (i .gt. 0) sums(i) = allchk
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine synchup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      integer ierr
+      call timer_start(T_synch)
+      call mpi_barrier(MPI_COMM_WORLD, ierr)
+      call timer_stop(T_synch)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine verify (d1, d2, d3, nt, verified, class)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      include 'mpinpb.h'
+      integer d1, d2, d3, nt
+      character class
+      logical verified
+      integer ierr, size, i
+      double precision err, epsilon
+
+c---------------------------------------------------------------------
+c   Reference checksums
+c---------------------------------------------------------------------
+      double complex csum_ref(25)
+
+
+      class = 'U'
+
+      if (me .ne. 0) return
+
+      epsilon = 1.0d-12
+      verified = .FALSE.
+
+      if (d1 .eq. 64 .and.
+     >    d2 .eq. 64 .and.
+     >    d3 .eq. 64 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Sample size reference checksums
+c---------------------------------------------------------------------
+         class = 'S'
+         csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02)
+         csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02)
+         csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02)
+         csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02)
+         csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02)
+         csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02)
+
+      else if (d1 .eq. 128 .and.
+     >    d2 .eq. 128 .and.
+     >    d3 .eq. 32 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Class W size reference checksums
+c---------------------------------------------------------------------
+         class = 'W'
+         csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02)
+         csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02)
+         csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02)
+         csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02)
+         csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02)
+         csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02)
+
+      else if (d1 .eq. 256 .and.
+     >    d2 .eq. 256 .and.
+     >    d3 .eq. 128 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Class A size reference checksums
+c---------------------------------------------------------------------
+         class = 'A'
+         csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02)
+         csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02)
+         csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02)
+         csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02)
+         csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02)
+         csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02)
+      
+      else if (d1 .eq. 512 .and.
+     >    d2 .eq. 256 .and.
+     >    d3 .eq. 256 .and.
+     >    nt .eq. 20) then
+c---------------------------------------------------------------------
+c   Class B size reference checksums
+c---------------------------------------------------------------------
+         class = 'B'
+         csum_ref(1)  = dcmplx(5.177643571579D+02, 5.077803458597D+02)
+         csum_ref(2)  = dcmplx(5.154521291263D+02, 5.088249431599D+02)
+         csum_ref(3)  = dcmplx(5.146409228649D+02, 5.096208912659D+02)
+         csum_ref(4)  = dcmplx(5.142378756213D+02, 5.101023387619D+02)
+         csum_ref(5)  = dcmplx(5.139626667737D+02, 5.103976610617D+02)
+         csum_ref(6)  = dcmplx(5.137423460082D+02, 5.105948019802D+02)
+         csum_ref(7)  = dcmplx(5.135547056878D+02, 5.107404165783D+02)
+         csum_ref(8)  = dcmplx(5.133910925466D+02, 5.108576573661D+02)
+         csum_ref(9)  = dcmplx(5.132470705390D+02, 5.109577278523D+02)
+         csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02)
+         csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02)
+         csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02)
+         csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02)
+         csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02)
+         csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02)
+         csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02)
+         csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02)
+         csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02)
+         csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02)
+         csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02)
+
+      else if (d1 .eq. 512 .and.
+     >    d2 .eq. 512 .and.
+     >    d3 .eq. 512 .and.
+     >    nt .eq. 20) then
+c---------------------------------------------------------------------
+c   Class C size reference checksums
+c---------------------------------------------------------------------
+         class = 'C'
+         csum_ref(1)  = dcmplx(5.195078707457D+02, 5.149019699238D+02)
+         csum_ref(2)  = dcmplx(5.155422171134D+02, 5.127578201997D+02)
+         csum_ref(3)  = dcmplx(5.144678022222D+02, 5.122251847514D+02)
+         csum_ref(4)  = dcmplx(5.140150594328D+02, 5.121090289018D+02)
+         csum_ref(5)  = dcmplx(5.137550426810D+02, 5.121143685824D+02)
+         csum_ref(6)  = dcmplx(5.135811056728D+02, 5.121496764568D+02)
+         csum_ref(7)  = dcmplx(5.134569343165D+02, 5.121870921893D+02)
+         csum_ref(8)  = dcmplx(5.133651975661D+02, 5.122193250322D+02)
+         csum_ref(9)  = dcmplx(5.132955192805D+02, 5.122454735794D+02)
+         csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02)
+         csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02)
+         csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02)
+         csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02)
+         csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02)
+         csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02)
+         csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02)
+         csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02)
+         csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02)
+         csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02)
+         csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02)
+
+      else if (d1 .eq. 2048 .and.
+     >    d2 .eq. 1024 .and.
+     >    d3 .eq. 1024 .and.
+     >    nt .eq. 25) then
+c---------------------------------------------------------------------
+c   Class D size reference checksums
+c---------------------------------------------------------------------
+         class = 'D'
+         csum_ref(1)  = dcmplx(5.122230065252D+02, 5.118534037109D+02)
+         csum_ref(2)  = dcmplx(5.120463975765D+02, 5.117061181082D+02)
+         csum_ref(3)  = dcmplx(5.119865766760D+02, 5.117096364601D+02)
+         csum_ref(4)  = dcmplx(5.119518799488D+02, 5.117373863950D+02)
+         csum_ref(5)  = dcmplx(5.119269088223D+02, 5.117680347632D+02)
+         csum_ref(6)  = dcmplx(5.119082416858D+02, 5.117967875532D+02)
+         csum_ref(7)  = dcmplx(5.118943814638D+02, 5.118225281841D+02)
+         csum_ref(8)  = dcmplx(5.118842385057D+02, 5.118451629348D+02)
+         csum_ref(9)  = dcmplx(5.118769435632D+02, 5.118649119387D+02)
+         csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02)
+         csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02)
+         csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02)
+         csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02)
+         csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02)
+         csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02)
+         csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02)
+         csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02)
+         csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02)
+         csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02)
+         csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02)
+         csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02)
+         csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02)
+         csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02)
+         csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02)
+         csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02)
+
+      else if (d1 .eq. 4096 .and.
+     >    d2 .eq. 2048 .and.
+     >    d3 .eq. 2048 .and.
+     >    nt .eq. 25) then
+c---------------------------------------------------------------------
+c   Class E size reference checksums
+c---------------------------------------------------------------------
+         class = 'E'
+         csum_ref(1)  = dcmplx(5.121601045346D+02, 5.117395998266D+02)
+         csum_ref(2)  = dcmplx(5.120905403678D+02, 5.118614716182D+02)
+         csum_ref(3)  = dcmplx(5.120623229306D+02, 5.119074203747D+02)
+         csum_ref(4)  = dcmplx(5.120438418997D+02, 5.119345900733D+02)
+         csum_ref(5)  = dcmplx(5.120311521872D+02, 5.119551325550D+02)
+         csum_ref(6)  = dcmplx(5.120226088809D+02, 5.119720179919D+02)
+         csum_ref(7)  = dcmplx(5.120169296534D+02, 5.119861371665D+02)
+         csum_ref(8)  = dcmplx(5.120131225172D+02, 5.119979364402D+02)
+         csum_ref(9)  = dcmplx(5.120104767108D+02, 5.120077674092D+02)
+         csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02)
+         csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02)
+         csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02)
+         csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02)
+         csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02)
+         csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02)
+         csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02)
+         csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02)
+         csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02)
+         csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02)
+         csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02)
+         csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02)
+         csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02)
+         csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02)
+         csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02)
+         csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02)
+
+      endif
+
+
+      if (class .ne. 'U') then
+
+         do i = 1, nt
+            err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) )
+            if (.not.(err .le. epsilon)) goto 100
+         end do
+         verified = .TRUE.
+ 100     continue
+
+      endif
+
+      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
+      if (size .ne. np) then
+         write(*, 4010) np
+         write(*, 4011)
+         write(*, 4012)
+c---------------------------------------------------------------------
+c multiple statements because some Fortran compilers have
+c problems with long strings. 
+c---------------------------------------------------------------------
+ 4010    format( ' Warning: benchmark was compiled for ', i5, 
+     >           'processors')
+ 4011    format( ' Must be run on this many processors for official',
+     >           ' verification')
+ 4012    format( ' so memory access is repeatable')
+         verified = .false.
+      endif
+         
+      if (class .ne. 'U') then
+         if (verified) then
+            write(*,2000)
+ 2000       format(' Result verification successful')
+         else
+            write(*,2001)
+ 2001       format(' Result verification failed')
+         endif
+      endif
+      print *, 'class = ', class
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/global.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/global.h
new file mode 100644
index 0000000..29a8656
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/global.h
@@ -0,0 +1,134 @@
+      include 'npbparams.h'
+
+c 2D processor array -> 2D grid decomposition (by pencils)
+c If processor array is 1xN or -> 1D grid decomposition (by planes)
+c If processor array is 1x1 -> 0D grid decomposition
+c For simplicity, do not treat Nx1 (np2 = 1) specially
+      integer np1, np2, np
+
+c basic decomposition strategy
+      integer layout_type
+      integer layout_0D, layout_1D, layout_2D
+      parameter (layout_0D = 0, layout_1D = 1, layout_2D = 2)
+
+      common /procgrid/ np1, np2, layout_type, np
+
+
+c Cache blocking params. These values are good for most
+c RISC processors.  
+c FFT parameters:
+c  fftblock controls how many ffts are done at a time. 
+c  The default is appropriate for most cache-based machines
+c  On vector machines, the FFT can be vectorized with vector
+c  length equal to the block size, so the block size should
+c  be as large as possible. This is the size of the smallest
+c  dimension of the problem: 128 for class A, 256 for class B and
+c  512 for class C.
+c Transpose parameters:
+c  transblock is the blocking factor for the transposes when there
+c  is a 1-D layout. On vector machines it should probably be
+c  large (largest dimension of the problem).
+
+
+      integer fftblock_default, fftblockpad_default
+      parameter (fftblock_default=16, fftblockpad_default=18)
+      integer transblock, transblockpad
+      parameter(transblock=32, transblockpad=34)
+      
+      integer fftblock, fftblockpad
+      common /blockinfo/ fftblock, fftblockpad
+
+c we need a bunch of logic to keep track of how
+c arrays are laid out. 
+c coords of this processor
+      integer me, me1, me2
+      common /coords/ me, me1, me2
+c need a communicator for row/col in processor grid
+      integer commslice1, commslice2
+      common /comms/ commslice1, commslice2
+
+
+
+c There are basically three stages
+c 1: x-y-z layout
+c 2: after x-transform (before y)
+c 3: after y-transform (before z)
+c The computation proceeds logically as
+
+c set up initial conditions
+c fftx(1)
+c transpose (1->2)
+c ffty(2)
+c transpose (2->3)
+c fftz(3)
+c time evolution
+c fftz(3)
+c transpose (3->2)
+c ffty(2)
+c transpose (2->1)
+c fftx(1)
+c compute residual(1)
+
+c for the 0D, 1D, 2D strategies, the layouts look like xxx
+c        
+c            0D        1D        2D
+c 1:        xyz       xyz       xyz
+c 2:        xyz       xyz       yxz
+c 3:        xyz       zyx       zxy
+
+c the array dimensions are stored in dims(coord, phase)
+      integer dims(3, 3)
+      integer xstart(3), ystart(3), zstart(3)
+      integer xend(3), yend(3), zend(3)
+      common /layout/ dims,
+     >                xstart, ystart, zstart, 
+     >                xend, yend, zend
+
+      integer T_total, T_setup, T_fft, T_evolve, T_checksum, 
+     >        T_fftlow, T_fftcopy, T_transpose, 
+     >        T_transxzloc, T_transxzglo, T_transxzfin, 
+     >        T_transxyloc, T_transxyglo, T_transxyfin, 
+     >        T_synch, T_init, T_max
+      parameter (T_total = 1, T_setup = 2, T_fft = 3, 
+     >           T_evolve = 4, T_checksum = 5, 
+     >           T_fftlow = 6, T_fftcopy = 7, T_transpose = 8,
+     >           T_transxzloc = 9, T_transxzglo = 10, T_transxzfin = 11, 
+     >           T_transxyloc = 12, T_transxyglo = 13, 
+     >           T_transxyfin = 14,  T_synch = 15, T_init = 16,
+     >           T_max = 16)
+
+
+
+      logical timers_enabled
+
+
+      external timer_read
+      double precision timer_read
+      external ilog2
+      integer ilog2
+
+      external randlc
+      double precision randlc
+
+
+c other stuff
+      logical debug, debugsynch
+      common /dbg/ debug, debugsynch, timers_enabled
+
+      double precision seed, a, pi, alpha
+      parameter (seed = 314159265.d0, a = 1220703125.d0, 
+     >  pi = 3.141592653589793238d0, alpha=1.0d-6)
+
+c roots of unity array
+c relies on x being largest dimension?
+      double complex u(nx)
+      common /ucomm/ u
+
+
+c for checksum data
+      double complex sums(0:niter_default)
+      common /sumcomm/ sums
+
+c number of iterations
+      integer niter
+      common /iter/ niter
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/inputft.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/inputft.data.sample
new file mode 100644
index 0000000..448ac42
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/inputft.data.sample
@@ -0,0 +1,3 @@
+6   ! number of iterations
+2   ! layout type. 0 = 0d, 1 = 1d, 2 = 2d
+2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N"
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/mpinpb.h
new file mode 100644
index 0000000..e43e552
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/FT/mpinpb.h
@@ -0,0 +1,4 @@
+      include 'mpif.h'
+c mpi data types
+      integer dc_type
+      common /mpistuff/ dc_type
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/Makefile
new file mode 100644
index 0000000..0ac4ae9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=is
+BENCHMARKU=IS
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = is.o ${COMMON}/c_print_results.o ${COMMON}/c_timers.o
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${CMPI_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+is.o:             is.c  npbparams.h
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f is npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/is.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/is.c
new file mode 100644
index 0000000..39e64ab
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/IS/is.c
@@ -0,0 +1,1150 @@
+/*************************************************************************
+ *                                                                       * 
+ *        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3       *
+ *                                                                       * 
+ *                                  I S                                  * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is part of the NAS Parallel Benchmark 3.3 suite.     *
+ *   It is described in NAS Technical Report 95-020.                     * 
+ *                                                                       * 
+ *   Permission to use, copy, distribute and modify this software        * 
+ *   for any purpose with or without fee is hereby granted.  We          * 
+ *   request, however, that all derived work reference the NAS           * 
+ *   Parallel Benchmarks 3.3. This software is provided "as is"          *
+ *   without express or implied warranty.                                * 
+ *                                                                       * 
+ *   Information on NPB 3.3, including the technical report, the         *
+ *   original specifications, source code, results and information       * 
+ *   on how to submit new results, is available at:                      * 
+ *                                                                       * 
+ *          http://www.nas.nasa.gov/Software/NPB                         * 
+ *                                                                       * 
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   * 
+ *   Send bug reports to              npb-bugs@nas.nasa.gov              * 
+ *                                                                       * 
+ *         NAS Parallel Benchmarks Group                                 * 
+ *         NASA Ames Research Center                                     * 
+ *         Mail Stop: T27A-1                                             * 
+ *         Moffett Field, CA   94035-1000                                * 
+ *                                                                       * 
+ *         E-mail:  npb@nas.nasa.gov                                     * 
+ *         Fax:     (650) 604-3957                                       * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Yarrow                                                   * 
+ *           H. Jin                                                      * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include "mpi.h"
+#include "npbparams.h"
+#include <stdlib.h>
+#include <stdio.h>
+
+/******************/
+/* default values */
+/******************/
+#ifndef CLASS
+#define CLASS 'S'
+#define NUM_PROCS            1                 
+#endif
+#define MIN_PROCS            1
+
+
+/*************/
+/*  CLASS S  */
+/*************/
+#if CLASS == 'S'
+#define  TOTAL_KEYS_LOG_2    16
+#define  MAX_KEY_LOG_2       11
+#define  NUM_BUCKETS_LOG_2   9
+#endif
+
+
+/*************/
+/*  CLASS W  */
+/*************/
+#if CLASS == 'W'
+#define  TOTAL_KEYS_LOG_2    20
+#define  MAX_KEY_LOG_2       16
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+/*************/
+/*  CLASS A  */
+/*************/
+#if CLASS == 'A'
+#define  TOTAL_KEYS_LOG_2    23
+#define  MAX_KEY_LOG_2       19
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS B  */
+/*************/
+#if CLASS == 'B'
+#define  TOTAL_KEYS_LOG_2    25
+#define  MAX_KEY_LOG_2       21
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS C  */
+/*************/
+#if CLASS == 'C'
+#define  TOTAL_KEYS_LOG_2    27
+#define  MAX_KEY_LOG_2       23
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS D  */
+/*************/
+#if CLASS == 'D'
+#define  TOTAL_KEYS_LOG_2    29
+#define  MAX_KEY_LOG_2       27
+#define  NUM_BUCKETS_LOG_2   10
+#undef   MIN_PROCS
+#define  MIN_PROCS           4
+#endif
+
+
+#define  TOTAL_KEYS          (1 << TOTAL_KEYS_LOG_2)
+#define  MAX_KEY             (1 << MAX_KEY_LOG_2)
+#define  NUM_BUCKETS         (1 << NUM_BUCKETS_LOG_2)
+#define  NUM_KEYS            (TOTAL_KEYS/NUM_PROCS*MIN_PROCS)
+
+/*****************************************************************/
+/* On larger number of processors, since the keys are (roughly)  */ 
+/* gaussian distributed, the first and last processor sort keys  */ 
+/* in a large interval, requiring array sizes to be larger. Note */
+/* that for large NUM_PROCS, NUM_KEYS is, however, a small number*/
+/* The required array size also depends on the bucket size used. */
+/* The following values are validated for the 1024-bucket setup. */
+/*****************************************************************/
+#if   NUM_PROCS < 256
+#define  SIZE_OF_BUFFERS     3*NUM_KEYS/2
+#elif NUM_PROCS < 512
+#define  SIZE_OF_BUFFERS     5*NUM_KEYS/2
+#elif NUM_PROCS < 1024
+#define  SIZE_OF_BUFFERS     4*NUM_KEYS
+#else
+#define  SIZE_OF_BUFFERS     13*NUM_KEYS/2
+#endif
+
+/*****************************************************************/
+/* NOTE: THIS CODE CANNOT BE RUN ON ARBITRARILY LARGE NUMBERS OF */
+/* PROCESSORS. THE LARGEST VERIFIED NUMBER IS 1024. INCREASE     */
+/* MAX_PROCS AT YOUR PERIL                                       */
+/*****************************************************************/
+#if CLASS == 'S'
+#define  MAX_PROCS           128
+#else
+#define  MAX_PROCS           1024
+#endif
+
+#define  MAX_ITERATIONS      10
+#define  TEST_ARRAY_SIZE     5
+
+
+/***********************************/
+/* Enable separate communication,  */
+/* computation timing and printout */
+/***********************************/
+#define  TIMING_ENABLED
+#ifdef NO_MTIMERS
+#undef TIMINIG_ENABLED
+#define TIMER_START( x )
+#define TIMER_STOP( x )
+#else
+#define TIMER_START( x ) if (timeron) timer_start( x )
+#define TIMER_STOP( x ) if (timeron) timer_stop( x )
+#define T_TOTAL  0
+#define T_RANK   1
+#define T_RCOMM  2
+#define T_VERIFY 3
+#define T_LAST   3
+#endif
+int timeron;
+
+
+/*************************************/
+/* Typedef: if necessary, change the */
+/* size of int here by changing the  */
+/* int type to, say, long            */
+/*************************************/
+typedef  int  INT_TYPE;
+typedef  long INT_TYPE2;
+#define MP_KEY_TYPE MPI_INT
+
+
+
+/********************/
+/* MPI properties:  */
+/********************/
+int      my_rank,
+         comm_size;
+
+
+/********************/
+/* Some global info */
+/********************/
+INT_TYPE *key_buff_ptr_global,         /* used by full_verify to get */
+         total_local_keys,             /* copies of rank info        */
+         total_lesser_keys;
+
+
+int      passed_verification;
+                                 
+
+
+/************************************/
+/* These are the three main arrays. */
+/* See SIZE_OF_BUFFERS def above    */
+/************************************/
+INT_TYPE key_array[SIZE_OF_BUFFERS],    
+         key_buff1[SIZE_OF_BUFFERS],    
+         key_buff2[SIZE_OF_BUFFERS],
+         bucket_size[NUM_BUCKETS+TEST_ARRAY_SIZE],     /* Top 5 elements for */
+         bucket_size_totals[NUM_BUCKETS+TEST_ARRAY_SIZE], /* part. ver. vals */
+         bucket_ptrs[NUM_BUCKETS],
+         process_bucket_distrib_ptr1[NUM_BUCKETS+TEST_ARRAY_SIZE],   
+         process_bucket_distrib_ptr2[NUM_BUCKETS+TEST_ARRAY_SIZE];   
+int      send_count[MAX_PROCS], recv_count[MAX_PROCS],
+         send_displ[MAX_PROCS], recv_displ[MAX_PROCS];
+
+
+/**********************/
+/* Partial verif info */
+/**********************/
+INT_TYPE2 test_index_array[TEST_ARRAY_SIZE],
+         test_rank_array[TEST_ARRAY_SIZE],
+
+         S_test_index_array[TEST_ARRAY_SIZE] = 
+                             {48427,17148,23627,62548,4431},
+         S_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {0,18,346,64917,65463},
+
+         W_test_index_array[TEST_ARRAY_SIZE] = 
+                             {357773,934767,875723,898999,404505},
+         W_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1249,11698,1039987,1043896,1048018},
+
+         A_test_index_array[TEST_ARRAY_SIZE] = 
+                             {2112377,662041,5336171,3642833,4250760},
+         A_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {104,17523,123928,8288932,8388264},
+
+         B_test_index_array[TEST_ARRAY_SIZE] = 
+                             {41869,812306,5102857,18232239,26860214},
+         B_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {33422937,10244,59149,33135281,99}, 
+
+         C_test_index_array[TEST_ARRAY_SIZE] = 
+                             {44172927,72999161,74326391,129606274,21736814},
+         C_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {61147,882988,266290,133997595,133525895},
+
+         D_test_index_array[TEST_ARRAY_SIZE] = 
+                             {1317351170,995930646,1157283250,1503301535,1453734525},
+         D_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1,36538729,1978098519,2145192618,2147425337};
+
+
+
+/***********************/
+/* function prototypes */
+/***********************/
+double	randlc( double *X, double *A );
+
+void full_verify( void );
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_compiled,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+void    timer_clear( int n );
+void    timer_start( int n );
+void    timer_stop( int n );
+double  timer_read( int n );
+
+
+
+/*
+ *    FUNCTION RANDLC (X, A)
+ *
+ *  This routine returns a uniform pseudorandom double precision number in the
+ *  range (0, 1) by using the linear congruential generator
+ *
+ *  x_{k+1} = a x_k  (mod 2^46)
+ *
+ *  where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+ *  before repeating.  The argument A is the same as 'a' in the above formula,
+ *  and X is the same as x_0.  A and X must be odd double precision integers
+ *  in the range (1, 2^46).  The returned value RANDLC is normalized to be
+ *  between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+ *  the new seed x_1, so that subsequent calls to RANDLC using the same
+ *  arguments will generate a continuous sequence.
+ *
+ *  This routine should produce the same results on any computer with at least
+ *  48 mantissa bits in double precision floating point data.  On Cray systems,
+ *  double precision should be disabled.
+ *
+ *  David H. Bailey     October 26, 1990
+ *
+ *     IMPLICIT DOUBLE PRECISION (A-H, O-Z)
+ *     SAVE KS, R23, R46, T23, T46
+ *     DATA KS/0/
+ *
+ *  If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46,
+ *  T23 = 2 ^ 23, and T46 = 2 ^ 46.  These are computed in loops, rather than
+ *  by merely using the ** operator, in order to insure that the results are
+ *  exact on all systems.  This code assumes that 0.5D0 is represented exactly.
+ */
+
+
+/*****************************************************************/
+/*************           R  A  N  D  L  C             ************/
+/*************                                        ************/
+/*************    portable random number generator    ************/
+/*****************************************************************/
+
+double	randlc( double *X, double *A )
+{
+      static int        KS=0;
+      static double	R23, R46, T23, T46;
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
+
+
+
+/*****************************************************************/
+/************   F  I  N  D  _  M  Y  _  S  E  E  D    ************/
+/************                                         ************/
+/************ returns parallel random number seq seed ************/
+/*****************************************************************/
+
+/*
+ * Create a random number sequence of total length nn residing
+ * on np number of processors.  Each processor will therefore have a 
+ * subsequence of length nn/np.  This routine returns that random 
+ * number which is the first random number for the subsequence belonging
+ * to processor rank kn, and which is used as seed for proc kn ran # gen.
+ */
+
+double   find_my_seed( int  kn,       /* my processor rank, 0<=kn<=num procs */
+                       int  np,       /* np = num procs                      */
+                       long nn,       /* total num of ran numbers, all procs */
+                       double s,      /* Ran num seed, for ex.: 314159265.00 */
+                       double a )     /* Ran num gen mult, try 1220703125.00 */
+{
+
+  long   i;
+
+  double t1,t2,t3,an;
+  long   mq,nq,kk,ik;
+
+
+
+      nq = nn / np;
+
+      for( mq=0; nq>1; mq++,nq/=2 )
+          ;
+
+      t1 = a;
+
+      for( i=1; i<=mq; i++ )
+        t2 = randlc( &t1, &t1 );
+
+      an = t1;
+
+      kk = kn;
+      t1 = s;
+      t2 = an;
+
+      for( i=1; i<=100; i++ )
+      {
+        ik = kk / 2;
+        if( 2 * ik !=  kk ) 
+            t3 = randlc( &t1, &t2 );
+        if( ik == 0 ) 
+            break;
+        t3 = randlc( &t2, &t2 );
+        kk = ik;
+      }
+
+      return( t1 );
+
+}
+
+
+
+
+/*****************************************************************/
+/*************      C  R  E  A  T  E  _  S  E  Q      ************/
+/*****************************************************************/
+
+void	create_seq( double seed, double a )
+{
+	double x;
+	int    i, k;
+
+        k = MAX_KEY/4;
+
+	for (i=0; i<NUM_KEYS; i++)
+	{
+	    x = randlc(&seed, &a);
+	    x += randlc(&seed, &a);
+    	    x += randlc(&seed, &a);
+	    x += randlc(&seed, &a);  
+
+            key_array[i] = k*x;
+	}
+}
+
+
+
+
+/*****************************************************************/
+/*************    F  U  L  L  _  V  E  R  I  F  Y     ************/
+/*****************************************************************/
+
+
+void full_verify( void )
+{
+    MPI_Status  status;
+    MPI_Request request;
+    
+    INT_TYPE    i, j;
+    INT_TYPE    k, last_local_key;
+
+    
+    TIMER_START( T_VERIFY );
+
+/*  Now, finally, sort the keys:  */
+    for( i=0; i<total_local_keys; i++ )
+        key_array[--key_buff_ptr_global[key_buff2[i]]-
+                                 total_lesser_keys] = key_buff2[i];
+    last_local_key = (total_local_keys<1)? 0 : (total_local_keys-1);
+
+/*  Send largest key value to next processor  */
+    if( my_rank > 0 )
+        MPI_Irecv( &k,
+                   1,
+                   MP_KEY_TYPE,
+                   my_rank-1,
+                   1000,
+                   MPI_COMM_WORLD,
+                   &request );                   
+    if( my_rank < comm_size-1 )
+        MPI_Send( &key_array[last_local_key],
+                  1,
+                  MP_KEY_TYPE,
+                  my_rank+1,
+                  1000,
+                  MPI_COMM_WORLD );
+    if( my_rank > 0 )
+        MPI_Wait( &request, &status );
+
+/*  Confirm that neighbor's greatest key value 
+    is not greater than my least key value       */              
+    j = 0;
+    if( my_rank > 0 && total_local_keys > 0 )
+        if( k > key_array[0] )
+            j++;
+
+
+/*  Confirm keys correctly sorted: count incorrectly sorted keys, if any */
+    for( i=1; i<total_local_keys; i++ )
+        if( key_array[i-1] > key_array[i] )
+            j++;
+
+
+    if( j != 0 )
+    {
+        printf( "Processor %d:  Full_verify: number of keys out of sort: %d\n",
+                my_rank, j );
+    }
+    else
+        passed_verification++;
+           
+    TIMER_STOP( T_VERIFY );
+
+}
+
+
+
+
+/*****************************************************************/
+/*************             R  A  N  K             ****************/
+/*****************************************************************/
+
+
+void rank( int iteration )
+{
+
+    INT_TYPE    i, k;
+
+    INT_TYPE    shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2;
+    INT_TYPE    key;
+    INT_TYPE2   bucket_sum_accumulator, j, m;
+    INT_TYPE    local_bucket_sum_accumulator;
+    INT_TYPE    min_key_val, max_key_val;
+    INT_TYPE    *key_buff_ptr;
+
+
+
+    TIMER_START( T_RANK );
+
+/*  Iteration alteration of keys */  
+    if(my_rank == 0 )                    
+    {
+      key_array[iteration] = iteration;
+      key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration;
+    }
+
+
+/*  Initialize */
+    for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ )  
+    {
+        bucket_size[i] = 0;
+        bucket_size_totals[i] = 0;
+        process_bucket_distrib_ptr1[i] = 0;
+        process_bucket_distrib_ptr2[i] = 0;
+    }
+
+
+/*  Determine where the partial verify test keys are, load into  */
+/*  top of array bucket_size                                     */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        if( (test_index_array[i]/NUM_KEYS) == my_rank )
+            bucket_size[NUM_BUCKETS+i] = 
+                          key_array[test_index_array[i] % NUM_KEYS];
+
+
+/*  Determine the number of keys in each bucket */
+    for( i=0; i<NUM_KEYS; i++ )
+        bucket_size[key_array[i] >> shift]++;
+
+
+/*  Accumulative bucket sizes are the bucket pointers */
+    bucket_ptrs[0] = 0;
+    for( i=1; i< NUM_BUCKETS; i++ )  
+        bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1];
+
+
+/*  Sort into appropriate bucket */
+    for( i=0; i<NUM_KEYS; i++ )  
+    {
+        key = key_array[i];
+        key_buff1[bucket_ptrs[key >> shift]++] = key;
+    }
+
+    TIMER_STOP( T_RANK );
+    TIMER_START( T_RCOMM );
+
+/*  Get the bucket size totals for the entire problem. These 
+    will be used to determine the redistribution of keys      */
+    MPI_Allreduce( bucket_size, 
+                   bucket_size_totals, 
+                   NUM_BUCKETS+TEST_ARRAY_SIZE, 
+                   MP_KEY_TYPE,
+                   MPI_SUM,
+                   MPI_COMM_WORLD );
+
+    TIMER_STOP( T_RCOMM );
+    TIMER_START( T_RANK );
+
+/*  Determine Redistibution of keys: accumulate the bucket size totals 
+    till this number surpasses NUM_KEYS (which the average number of keys
+    per processor).  Then all keys in these buckets go to processor 0.
+    Continue accumulating again until supassing 2*NUM_KEYS. All keys
+    in these buckets go to processor 1, etc.  This algorithm guarantees
+    that all processors have work ranking; no processors are left idle.
+    The optimum number of buckets, however, does not result in as high
+    a degree of load balancing (as even a distribution of keys as is
+    possible) as is obtained from increasing the number of buckets, but
+    more buckets results in more computation per processor so that the
+    optimum number of buckets turns out to be 1024 for machines tested.
+    Note that process_bucket_distrib_ptr1 and ..._ptr2 hold the bucket
+    number of first and last bucket which each processor will have after   
+    the redistribution is done.                                          */
+
+    bucket_sum_accumulator = 0;
+    local_bucket_sum_accumulator = 0;
+    send_displ[0] = 0;
+    process_bucket_distrib_ptr1[0] = 0;
+    for( i=0, j=0; i<NUM_BUCKETS; i++ )  
+    {
+        bucket_sum_accumulator       += bucket_size_totals[i];
+        local_bucket_sum_accumulator += bucket_size[i];
+        if( bucket_sum_accumulator >= (j+1)*NUM_KEYS )  
+        {
+            send_count[j] = local_bucket_sum_accumulator;
+            if( j != 0 )
+            {
+                send_displ[j] = send_displ[j-1] + send_count[j-1];
+                process_bucket_distrib_ptr1[j] = 
+                                        process_bucket_distrib_ptr2[j-1]+1;
+            }
+            process_bucket_distrib_ptr2[j++] = i;
+            local_bucket_sum_accumulator = 0;
+        }
+    }
+
+/*  When NUM_PROCS approaching NUM_BUCKETS, it is highly possible
+    that the last few processors don't get any buckets.  So, we
+    need to set counts properly in this case to avoid any fallouts.    */
+    while( j < comm_size )
+    {
+        send_count[j] = 0;
+        process_bucket_distrib_ptr1[j] = 1;
+        j++;
+    }
+
+    TIMER_STOP( T_RANK );
+    TIMER_START( T_RCOMM ); 
+
+/*  This is the redistribution section:  first find out how many keys
+    each processor will send to every other processor:                 */
+    MPI_Alltoall( send_count,
+                  1,
+                  MPI_INT,
+                  recv_count,
+                  1,
+                  MPI_INT,
+                  MPI_COMM_WORLD );
+
+/*  Determine the receive array displacements for the buckets */    
+    recv_displ[0] = 0;
+    for( i=1; i<comm_size; i++ )
+        recv_displ[i] = recv_displ[i-1] + recv_count[i-1];
+
+
+/*  Now send the keys to respective processors  */    
+    MPI_Alltoallv( key_buff1,
+                   send_count,
+                   send_displ,
+                   MP_KEY_TYPE,
+                   key_buff2,
+                   recv_count,
+                   recv_displ,
+                   MP_KEY_TYPE,
+                   MPI_COMM_WORLD );
+
+    TIMER_STOP( T_RCOMM ); 
+    TIMER_START( T_RANK );
+
+/*  The starting and ending bucket numbers on each processor are
+    multiplied by the interval size of the buckets to obtain the 
+    smallest possible min and greatest possible max value of any 
+    key on each processor                                          */
+    min_key_val = process_bucket_distrib_ptr1[my_rank] << shift;
+    max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1;
+
+/*  Clear the work array */
+    for( i=0; i<max_key_val-min_key_val+1; i++ )
+        key_buff1[i] = 0;
+
+/*  Determine the total number of keys on all other 
+    processors holding keys of lesser value         */
+    m = 0;
+    for( k=0; k<my_rank; k++ )
+        for( i= process_bucket_distrib_ptr1[k];
+             i<=process_bucket_distrib_ptr2[k];
+             i++ )  
+            m += bucket_size_totals[i]; /*  m has total # of lesser keys */
+
+/*  Determine total number of keys on this processor */
+    j = 0;                                 
+    for( i= process_bucket_distrib_ptr1[my_rank];
+         i<=process_bucket_distrib_ptr2[my_rank];
+         i++ )  
+        j += bucket_size_totals[i];     /* j has total # of local keys   */
+
+
+/*  Ranking of all keys occurs in this section:                 */
+/*  shift it backwards so no subtractions are necessary in loop */
+    key_buff_ptr = key_buff1 - min_key_val;
+
+/*  In this section, the keys themselves are used as their 
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+    for( i=0; i<j; i++ )
+        key_buff_ptr[key_buff2[i]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population, not forgetting the total of lesser keys, m.
+    NOTE: Since the total of lesser keys would be subtracted later 
+    in verification, it is no longer added to the first key population 
+    here, but still needed during the partial verify test.  This is to 
+    ensure that 32-bit key_buff can still be used for class D.           */
+/*    key_buff_ptr[min_key_val] += m;    */
+    for( i=min_key_val; i<max_key_val; i++ )   
+        key_buff_ptr[i+1] += key_buff_ptr[i];  
+
+
+/* This is the partial verify test section */
+/* Observe that test_rank_array vals are   */
+/* shifted differently for different cases */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+    {                                             
+        k = bucket_size_totals[i+NUM_BUCKETS];    /* Keys were hidden here */
+        if( min_key_val <= k  &&  k <= max_key_val )
+        {
+            /* Add the total of lesser keys, m, here */
+            INT_TYPE2 key_rank = key_buff_ptr[k-1] + m;
+            int failed = 0;
+
+            switch( CLASS )
+            {
+                case 'S':
+                    if( i <= 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'W':
+                    if( i < 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+(iteration-2) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'A':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'B':
+                    if( i == 1 || i == 2 || i == 4 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'C':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'D':
+                    if( i < 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+            }
+            if( failed == 1 )
+                printf( "Failed partial verification: "
+                        "iteration %d, processor %d, test key %d\n", 
+                         iteration, my_rank, (int)i );
+        }
+    }
+
+
+    TIMER_STOP( T_RANK ); 
+
+
+/*  Make copies of rank info for use by full_verify: these variables
+    in rank are local; making them global slows down the code, probably
+    since they cannot be made register by compiler                        */
+
+    if( iteration == MAX_ITERATIONS ) 
+    {
+        key_buff_ptr_global = key_buff_ptr;
+        total_local_keys    = j;
+        total_lesser_keys   = 0;  /* no longer set to 'm', see note above */
+    }
+
+}      
+
+
+/*****************************************************************/
+/*************             M  A  I  N             ****************/
+/*****************************************************************/
+
+int main( int argc, char **argv )
+{
+
+    int             i, iteration, itemp;
+
+    double          timecounter, maxtime;
+
+
+/*  Initialize MPI */
+    MPI_Init( &argc, &argv );
+    MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
+    MPI_Comm_size( MPI_COMM_WORLD, &comm_size );
+
+
+/*  Initialize the verification arrays if a valid class */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        switch( CLASS )
+        {
+            case 'S':
+                test_index_array[i] = S_test_index_array[i];
+                test_rank_array[i]  = S_test_rank_array[i];
+                break;
+            case 'A':
+                test_index_array[i] = A_test_index_array[i];
+                test_rank_array[i]  = A_test_rank_array[i];
+                break;
+            case 'W':
+                test_index_array[i] = W_test_index_array[i];
+                test_rank_array[i]  = W_test_rank_array[i];
+                break;
+            case 'B':
+                test_index_array[i] = B_test_index_array[i];
+                test_rank_array[i]  = B_test_rank_array[i];
+                break;
+            case 'C':
+                test_index_array[i] = C_test_index_array[i];
+                test_rank_array[i]  = C_test_rank_array[i];
+                break;
+            case 'D':
+                test_index_array[i] = D_test_index_array[i];
+                test_rank_array[i]  = D_test_rank_array[i];
+                break;
+        };
+
+        
+
+/*  Printout initial NPB info */
+    if( my_rank == 0 )
+    {
+        FILE *fp;
+        printf( "\n\n NAS Parallel Benchmarks 3.3 -- IS Benchmark\n\n" );
+        printf( " Size:  %ld  (class %c)\n", (long)TOTAL_KEYS*MIN_PROCS, CLASS );
+        printf( " Iterations:   %d\n", MAX_ITERATIONS );
+        printf( " Number of processes:     %d\n", comm_size );
+
+        fp = fopen("timer.flag", "r");
+        timeron = 0;
+        if (fp) {
+            timeron = 1;
+            fclose(fp);
+        }
+    }
+
+/*  Check that actual and compiled number of processors agree */
+    if( comm_size != NUM_PROCS )
+    {
+        if( my_rank == 0 )
+            printf( "\n ERROR: compiled for %d processes\n"
+                    " Number of active processes: %d\n"
+                    " Exiting program!\n\n", NUM_PROCS, comm_size );
+        MPI_Finalize();
+        exit( 1 );
+    }
+
+/*  Check to see whether total number of processes is within bounds.
+    This could in principle be checked in setparams.c, but it is more
+    convenient to do it here                                               */
+    if( comm_size < MIN_PROCS || comm_size > MAX_PROCS)
+    {
+       if( my_rank == 0 )
+           printf( "\n ERROR: number of processes %d not within range %d-%d"
+                   "\n Exiting program!\n\n", comm_size, MIN_PROCS, MAX_PROCS);
+       MPI_Finalize();
+       exit( 1 );
+    }
+
+    MPI_Bcast(&timeron, 1, MPI_INT, 0, MPI_COMM_WORLD);
+
+#ifdef  TIMING_ENABLED 
+    for( i=1; i<=T_LAST; i++ ) timer_clear( i );
+#endif
+
+/*  Generate random number sequence and subsequent keys on all procs */
+    create_seq( find_my_seed( my_rank, 
+                              comm_size, 
+                              4*(long)TOTAL_KEYS*MIN_PROCS,
+                              314159265.00,      /* Random number gen seed */
+                              1220703125.00 ),   /* Random number gen mult */
+                1220703125.00 );                 /* Random number gen mult */
+
+
+/*  Do one interation for free (i.e., untimed) to guarantee initialization of  
+    all data and code pages and respective tables */
+    rank( 1 );  
+
+/*  Start verification counter */
+    passed_verification = 0;
+
+    if( my_rank == 0 && CLASS != 'S' ) printf( "\n   iteration\n" );
+
+/*  Initialize timer  */             
+    timer_clear( 0 );
+
+/*  Initialize separate communication, computation timing */
+#ifdef  TIMING_ENABLED 
+    for( i=1; i<=T_LAST; i++ ) timer_clear( i );
+#endif
+
+/*  Start timer  */             
+    timer_start( 0 );
+
+
+/*  This is the main iteration */
+    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
+    {
+        if( my_rank == 0 && CLASS != 'S' ) printf( "        %d\n", iteration );
+        rank( iteration );
+    }
+
+
+/*  Stop timer, obtain time for processors */
+    timer_stop( 0 );
+
+    timecounter = timer_read( 0 );
+
+/*  End of timing, obtain maximum time of all processors */
+    MPI_Reduce( &timecounter,
+                &maxtime,
+                1,
+                MPI_DOUBLE,
+                MPI_MAX,
+                0,
+                MPI_COMM_WORLD );
+
+
+/*  This tests that keys are in sequence: sorting of last ranked key seq
+    occurs here, but is an untimed operation                             */
+    full_verify();
+
+
+/*  Obtain verification counter sum */
+    itemp = passed_verification;
+    MPI_Reduce( &itemp,
+                &passed_verification,
+                1,
+                MPI_INT,
+                MPI_SUM,
+                0,
+                MPI_COMM_WORLD );
+
+
+
+/*  The final printout  */
+    if( my_rank == 0 )
+    {
+        if( passed_verification != 5*MAX_ITERATIONS + comm_size )
+            passed_verification = 0;
+        c_print_results( "IS",
+                         CLASS,
+                         (int)(TOTAL_KEYS),
+                         MIN_PROCS,
+                         0,
+                         MAX_ITERATIONS,
+                         NUM_PROCS,
+                         comm_size,
+                         maxtime,
+                         ((double) (MAX_ITERATIONS)*TOTAL_KEYS*MIN_PROCS)
+                                                      /maxtime/1000000.,
+                         "keys ranked", 
+                         passed_verification,
+                         NPBVERSION,
+                         COMPILETIME,
+                         MPICC,
+                         CLINK,
+                         CMPI_LIB,
+                         CMPI_INC,
+                         CFLAGS,
+                         CLINKFLAGS );
+    }
+                    
+
+#ifdef  TIMING_ENABLED
+    if (timeron)
+    {
+        double    t1[T_LAST+1], tmin[T_LAST+1], tsum[T_LAST+1], tmax[T_LAST+1];
+        char      t_recs[T_LAST+1][9];
+    
+        for( i=0; i<=T_LAST; i++ )
+            t1[i] = timer_read( i );
+
+        MPI_Reduce( t1,
+                    tmin,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_MIN,
+                    0,
+                    MPI_COMM_WORLD );
+        MPI_Reduce( t1,
+                    tsum,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_SUM,
+                    0,
+                    MPI_COMM_WORLD );
+        MPI_Reduce( t1,
+                    tmax,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_MAX,
+                    0,
+                    MPI_COMM_WORLD );
+
+        if( my_rank == 0 )
+        {
+            strcpy( t_recs[T_TOTAL],  "total" );
+            strcpy( t_recs[T_RANK],   "rcomp" );
+            strcpy( t_recs[T_RCOMM],  "rcomm" );
+            strcpy( t_recs[T_VERIFY], "verify");
+            printf( " nprocs = %6d     ", comm_size);
+            printf( "     minimum     maximum     average\n" );
+            for( i=0; i<=T_LAST; i++ )
+            {
+                printf( " timer %2d (%-8s):  %10.4f  %10.4f  %10.4f\n",
+                        i+1, t_recs[i], tmin[i], tmax[i], 
+                        tsum[i]/((double) comm_size) );
+            }
+            printf( "\n" );
+        }
+    }
+#endif
+
+    MPI_Finalize();
+
+
+    return 0;
+         /**************************/
+}        /*  E N D  P R O G R A M  */
+         /**************************/
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/Makefile
new file mode 100644
index 0000000..62891f8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/Makefile
@@ -0,0 +1,73 @@
+SHELL=/bin/sh
+BENCHMARK=lu
+BENCHMARKU=LU
+VEC=
+
+include ../config/make.def
+
+OBJS = lu.o init_comm.o read_input.o bcast_inputs.o proc_grid.o neighbors.o \
+       nodedim.o subdomain.o setcoeff.o setbv.o exact.o setiv.o \
+       erhs.o ssor.o exchange_1.o exchange_3.o exchange_4.o exchange_5.o \
+       exchange_6.o rhs.o l2norm.o jacld.o blts$(VEC).o jacu.o buts$(VEC).o \
+       error.o pintgr.o verify.o ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+
+# npbparams.h is included by applu.incl
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# applu.incl: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f.o :
+	${FCOMPILE} $<
+
+lu.o:		lu.f applu.incl npbparams.h
+bcast_inputs.o:	bcast_inputs.f applu.incl npbparams.h mpinpb.h
+blts$(VEC).o:	blts$(VEC).f timing.h
+buts$(VEC).o:	buts$(VEC).f timing.h
+erhs.o:		erhs.f applu.incl npbparams.h
+error.o:	error.f applu.incl npbparams.h mpinpb.h
+exact.o:	exact.f applu.incl npbparams.h
+exchange_1.o:	exchange_1.f applu.incl npbparams.h mpinpb.h
+exchange_3.o:	exchange_3.f applu.incl npbparams.h mpinpb.h
+exchange_4.o:	exchange_4.f applu.incl npbparams.h mpinpb.h
+exchange_5.o:	exchange_5.f applu.incl npbparams.h mpinpb.h
+exchange_6.o:	exchange_6.f applu.incl npbparams.h mpinpb.h
+init_comm.o:	init_comm.f applu.incl npbparams.h mpinpb.h 
+jacld.o:	jacld.f applu.incl npbparams.h
+jacu.o:		jacu.f applu.incl npbparams.h
+l2norm.o:	l2norm.f mpinpb.h timing.h
+neighbors.o:	neighbors.f applu.incl npbparams.h
+nodedim.o:	nodedim.f
+pintgr.o:	pintgr.f applu.incl npbparams.h mpinpb.h
+proc_grid.o:	proc_grid.f applu.incl npbparams.h
+read_input.o:	read_input.f applu.incl npbparams.h mpinpb.h
+rhs.o:		rhs.f applu.incl npbparams.h
+setbv.o:	setbv.f applu.incl npbparams.h
+setiv.o:	setiv.f applu.incl npbparams.h
+setcoeff.o:	setcoeff.f applu.incl npbparams.h
+ssor.o:		ssor.f applu.incl npbparams.h mpinpb.h
+subdomain.o:	subdomain.f applu.incl npbparams.h mpinpb.h
+verify.o:	verify.f applu.incl npbparams.h
+
+clean:
+	- /bin/rm -f npbparams.h
+	- /bin/rm -f *.o *~
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/applu.incl b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/applu.incl
new file mode 100644
index 0000000..f2eb6b9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/applu.incl
@@ -0,0 +1,147 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c---  applu.incl   
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   npbparams.h defines parameters that depend on the class and 
+c   number of nodes
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c   parameters which can be overridden in runtime config file
+c   (in addition to size of problem - isiz01,02,03 give the maximum size)
+c   ipr = 1 to print out verbose information
+c   omega = 2.0 is correct for all classes
+c   tolrsd is tolerance levels for steady state residuals
+c---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def, 
+     >                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08, 
+     >          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08, 
+     >          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,
+     >           c3 = 1.00d-01, c4 = 1.00d+00,
+     >           c5 = 1.40d+00 )
+
+c---------------------------------------------------------------------
+c   grid
+c---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ipt, ist, iend
+      integer jpt, jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+      common/cgcon/ dxi, deta, dzeta,
+     >              tx1, tx2, tx3,
+     >              ty1, ty2, ty3,
+     >              tz1, tz2, tz3,
+     >              nx, ny, nz, 
+     >              nx0, ny0, nz0,
+     >              ipt, ist, iend,
+     >              jpt, jst, jend,
+     >              ii1, ii2, 
+     >              ji1, ji2, 
+     >              ki1, ki2
+
+c---------------------------------------------------------------------
+c   dissipation
+c---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+      common/disp/ dx1,dx2,dx3,dx4,dx5,
+     >             dy1,dy2,dy3,dy4,dy5,
+     >             dz1,dz2,dz3,dz4,dz5,
+     >             dssp
+
+c---------------------------------------------------------------------
+c   field variables and residuals
+c---------------------------------------------------------------------
+      double precision u(5,-1:isiz1+2,-1:isiz2+2,isiz3),
+     >       rsd(5,-1:isiz1+2,-1:isiz2+2,isiz3),
+     >       frct(5,-1:isiz1+2,-1:isiz2+2,isiz3),
+     >       flux(5,0:isiz1+1,0:isiz2+1,isiz3)
+
+      common/cvar/ u,
+     >             rsd,
+     >             frct,
+     >             flux
+
+
+c---------------------------------------------------------------------
+c   output control parameters
+c---------------------------------------------------------------------
+      integer ipr, inorm
+
+      common/cprcon/ ipr, inorm
+
+c---------------------------------------------------------------------
+c   newton-raphson iteration control parameters
+c---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),
+     >        rsdnm(5), errnm(5), frc, ttotal
+
+      common/ctscon/ dt, omega, tolrsd,
+     >               rsdnm, errnm, frc, ttotal,
+     >               itmax, invert
+
+      double precision a(5,5,isiz1,isiz2),
+     >       b(5,5,isiz1,isiz2),
+     >       c(5,5,isiz1,isiz2),
+     >       d(5,5,isiz1,isiz2)
+
+      common/cjac/ a, b, c, d
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution
+c---------------------------------------------------------------------
+      double precision ce(5,13)
+
+      common/cexact/ ce
+
+c---------------------------------------------------------------------
+c   multi-processor common blocks
+c---------------------------------------------------------------------
+      integer id, ndim, num, xdim, ydim, row, col
+      common/dim/ id,ndim,num,xdim,ydim,row,col
+
+      integer north,south,east,west
+      common/neigh/ north,south,east, west
+
+      integer from_s,from_n,from_e,from_w
+      parameter (from_s=1,from_n=2,from_e=3,from_w=4)
+
+      double precision  buf(5,2*isiz2*isiz3),
+     >                  buf1(5,2*isiz2*isiz3)
+
+      common/comm/ buf, buf1
+
+c---------------------------------------------------------------------
+
+      include 'timing.h'
+
+
+c---------------------------------------------------------------------
+c   end of include file
+c---------------------------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/bcast_inputs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/bcast_inputs.f
new file mode 100644
index 0000000..a6810b2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/bcast_inputs.f
@@ -0,0 +1,43 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine bcast_inputs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer ierr
+
+c---------------------------------------------------------------------
+c   root broadcasts the data
+c   The data isn't contiguous or of the same type, so it's not
+c   clear how to send it in the "MPI" way. 
+c   We could pack the info into a buffer or we could create
+c   an obscene datatype to handle it all at once. Since we only
+c   broadcast the data once, just use a separate broadcast for
+c   each piece. 
+c---------------------------------------------------------------------
+      call MPI_BCAST(ipr, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(inorm, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(itmax, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(dt, 1, dp_type, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(omega, 1, dp_type, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(tolrsd, 5, dp_type, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(nx0, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(ny0, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(nz0, 1, MPI_INTEGER, root, MPI_COMM_WORLD, ierr)
+      call MPI_BCAST(timeron, 1, MPI_LOGICAL, root, MPI_COMM_WORLD, 
+     &               ierr)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts.f
new file mode 100644
index 0000000..89aada3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts.f
@@ -0,0 +1,269 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v,
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  nx0, ny0, ipt, jpt)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),
+     >        ldz( 5, 5, ldmx, ldmy),
+     >        ldy( 5, 5, ldmx, ldmy),
+     >        ldx( 5, 5, ldmx, ldmy),
+     >        d( 5, 5, ldmx, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+      include 'timing.h'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+c---------------------------------------------------------------------
+c   receive data from north and west
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 0
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+
+      if (timeron) call timer_start(t_blts)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+      end do
+
+
+      do j=jst,jend
+        do i = ist, iend
+
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     > - omega * ( ldy( m, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( m, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( m, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( m, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( m, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( m, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( m, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( m, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( m, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( m, 5, i, j ) * v( 5, i-1, j, k ) )
+
+            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 3, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 3, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 4, i, j, k ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >                      / tmat( 5, 5 )
+
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >                      / tmat( 4, 4 )
+
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >                      / tmat( 3, 3 )
+
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >                      / tmat( 2, 2 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = v( 1, i, j, k )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+      if (timeron) call timer_stop(t_blts)
+
+c---------------------------------------------------------------------
+c   send data to east and south
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 2
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts_vec.f
new file mode 100644
index 0000000..3b2c9d0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/blts_vec.f
@@ -0,0 +1,342 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v,
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  nx0, ny0, ipt, jpt)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),
+     >        ldz( 5, 5, ldmx, ldmy),
+     >        ldy( 5, 5, ldmx, ldmy),
+     >        ldx( 5, 5, ldmx, ldmy),
+     >        d( 5, 5, ldmx, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+      include 'timing.h'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+c---------------------------------------------------------------------
+c   receive data from north and west
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 0
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+
+      if (timeron) call timer_start(t_blts)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+      end do
+
+
+      do l = ist+jst, iend+jend
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+
+                  v( 1, i, j, k ) =  v( 1, i, j, k )
+     > - omega * ( ldy( 1, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 1, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 1, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 1, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 1, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 1, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 1, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 1, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 1, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 1, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 2, i, j, k ) =  v( 2, i, j, k )
+     > - omega * ( ldy( 2, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 2, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 2, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 2, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 2, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 2, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 2, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 2, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 2, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 2, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 3, i, j, k ) =  v( 3, i, j, k )
+     > - omega * ( ldy( 3, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 3, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 3, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 3, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 3, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 3, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 3, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 3, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 3, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 3, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 4, i, j, k ) =  v( 4, i, j, k )
+     > - omega * ( ldy( 4, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 4, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 4, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 4, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 4, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 4, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 4, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 4, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 4, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 4, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 5, i, j, k ) =  v( 5, i, j, k )
+     > - omega * ( ldy( 5, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 5, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 5, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 5, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 5, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 5, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 5, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 5, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 5, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 5, 5, i, j ) * v( 5, i-1, j, k ) )
+
+!            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 1, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 2, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >        - v( 3, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 3, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >        - v( 4, i, j, k ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = v( 5, i, j, k )
+     >                      / tmat( 5, 5 )
+
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = v( 4, i, j, k )
+     >                      / tmat( 4, 4 )
+
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = v( 3, i, j, k )
+     >                      / tmat( 3, 3 )
+
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = v( 2, i, j, k )
+     >                      / tmat( 2, 2 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = v( 1, i, j, k )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+      if (timeron) call timer_stop(t_blts)
+
+c---------------------------------------------------------------------
+c   send data to east and south
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 2
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts.f
new file mode 100644
index 0000000..b86e313
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts.f
@@ -0,0 +1,267 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0, ipt, jpt )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *), 
+     >        tv(5, ldmx, ldmy),
+     >        d( 5, 5, ldmx, ldmy),
+     >        udx( 5, 5, ldmx, ldmy),
+     >        udy( 5, 5, ldmx, ldmy),
+     >        udz( 5, 5, ldmx, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+      include 'timing.h'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+c---------------------------------------------------------------------
+c   receive data from south and east
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 1
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+
+      if (timeron) call timer_start(t_buts)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do j = jend,jst,-1
+        do i = iend,ist,-1
+
+            do m = 1, 5
+                  tv( m, i, j ) = tv( m, i, j )
+     > + omega * ( udy( m, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( m, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( m, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( m, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( m, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( m, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( m, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( m, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( m, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( m, 5, i, j ) * v( 5, i+1, j, k ) )
+            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+
+        enddo
+      end do
+      if (timeron) call timer_stop(t_buts)
+
+c---------------------------------------------------------------------
+c   send data to north and west
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 3
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts_vec.f
new file mode 100644
index 0000000..c7571a4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/buts_vec.f
@@ -0,0 +1,340 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0, ipt, jpt )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *), 
+     >        tv(5, ldmx, ldmy),
+     >        d( 5, 5, ldmx, ldmy),
+     >        udx( 5, 5, ldmx, ldmy),
+     >        udy( 5, 5, ldmx, ldmy),
+     >        udz( 5, 5, ldmx, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+      include 'timing.h'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+c---------------------------------------------------------------------
+c   receive data from south and east
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 1
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+
+      if (timeron) call timer_start(t_buts)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do l = iend+jend, ist+jst, -1
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+                  tv( 1, i, j ) = tv( 1, i, j )
+     > + omega * ( udy( 1, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 1, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 1, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 1, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 1, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 1, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 1, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 1, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 1, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 1, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 2, i, j ) = tv( 2, i, j )
+     > + omega * ( udy( 2, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 2, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 2, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 2, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 2, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 2, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 2, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 2, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 2, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 2, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 3, i, j ) = tv( 3, i, j )
+     > + omega * ( udy( 3, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 3, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 3, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 3, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 3, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 3, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 3, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 3, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 3, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 3, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 4, i, j ) = tv( 4, i, j )
+     > + omega * ( udy( 4, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 4, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 4, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 4, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 4, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 4, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 4, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 4, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 4, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 4, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 5, i, j ) = tv( 5, i, j )
+     > + omega * ( udy( 5, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 5, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 5, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 5, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 5, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 5, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 5, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 5, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 5, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 5, 5, i, j ) * v( 5, i+1, j, k ) )
+!            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+
+        enddo
+      end do
+      if (timeron) call timer_stop(t_buts)
+
+c---------------------------------------------------------------------
+c   send data to north and west
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 3
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+ 
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/erhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/erhs.f
new file mode 100644
index 0000000..928e2a9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/erhs.f
@@ -0,0 +1,536 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine erhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the right hand side based on exact solution
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      integer iex
+      integer L1, L2
+      integer ist1, iend1
+      integer jst1, jend1
+      double precision  dsspm
+      double precision  xi, eta, zeta
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+      dsspm = dssp
+
+
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  frct( m, i, j, k ) = 0.0d+00
+               end do
+            end do
+         end do
+      end do
+
+      do k = 1, nz
+         zeta = ( dble(k-1) ) / ( nz - 1 )
+         do j = 1, ny
+            jglob = jpt + j
+            eta = ( dble(jglob-1) ) / ( ny0 - 1 )
+            do i = 1, nx
+               iglob = ipt + i
+               xi = ( dble(iglob-1) ) / ( nx0 - 1 )
+               do m = 1, 5
+                  rsd(m,i,j,k) =  ce(m,1)
+     >                 + ce(m,2) * xi
+     >                 + ce(m,3) * eta
+     >                 + ce(m,4) * zeta
+     >                 + ce(m,5) * xi * xi
+     >                 + ce(m,6) * eta * eta
+     >                 + ce(m,7) * zeta * zeta
+     >                 + ce(m,8) * xi * xi * xi
+     >                 + ce(m,9) * eta * eta * eta
+     >                 + ce(m,10) * zeta * zeta * zeta
+     >                 + ce(m,11) * xi * xi * xi * xi
+     >                 + ce(m,12) * eta * eta * eta * eta
+     >                 + ce(m,13) * zeta * zeta * zeta * zeta
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+c
+c   iex = flag : iex = 0  north/south communication
+c              : iex = 1  east/west communication
+c
+c---------------------------------------------------------------------
+      iex   = 0
+
+c---------------------------------------------------------------------
+c   communicate and receive/send two rows of data
+c---------------------------------------------------------------------
+      call exchange_3 (rsd,iex)
+
+      L1 = 0
+      if (north.eq.-1) L1 = 1
+      L2 = nx + 1
+      if (south.eq.-1) L2 = nx
+
+      ist1 = 1
+      iend1 = nx
+      if (north.eq.-1) ist1 = 4
+      if (south.eq.-1) iend1 = nx - 3
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = L1, L2
+               flux(1,i,j,k) = rsd(2,i,j,k)
+               u21 = rsd(2,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u21 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(3,i,j,k) = rsd(3,i,j,k) * u21
+               flux(4,i,j,k) = rsd(4,i,j,k) * u21
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u21
+            end do
+         end do
+      end do 
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                   - tx2 * ( flux(m,i+1,j,k) - flux(m,i-1,j,k) )
+               end do
+            end do
+            do i = ist, L2
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21i = tmp * rsd(2,i,j,k)
+               u31i = tmp * rsd(3,i,j,k)
+               u41i = tmp * rsd(4,i,j,k)
+               u51i = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i-1,j,k)
+
+               u21im1 = tmp * rsd(2,i-1,j,k)
+               u31im1 = tmp * rsd(3,i-1,j,k)
+               u41im1 = tmp * rsd(4,i-1,j,k)
+               u51im1 = tmp * rsd(5,i-1,j,k)
+
+               flux(2,i,j,k) = (4.0d+00/3.0d+00) * tx3 * 
+     >                        ( u21i - u21im1 )
+               flux(3,i,j,k) = tx3 * ( u31i - u31im1 )
+               flux(4,i,j,k) = tx3 * ( u41i - u41im1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dx1 * tx1 * (            rsd(1,i-1,j,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i+1,j,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(2,i+1,j,k) - flux(2,i,j,k) )
+     >              + dx2 * tx1 * (            rsd(2,i-1,j,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i+1,j,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(3,i+1,j,k) - flux(3,i,j,k) )
+     >              + dx3 * tx1 * (            rsd(3,i-1,j,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i+1,j,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >            + tx3 * c3 * c4 * ( flux(4,i+1,j,k) - flux(4,i,j,k) )
+     >              + dx4 * tx1 * (            rsd(4,i-1,j,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i+1,j,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(5,i+1,j,k) - flux(5,i,j,k) )
+     >              + dx5 * tx1 * (            rsd(5,i-1,j,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            IF (north.eq.-1) then
+             do m = 1, 5
+               frct(m,2,j,k) = frct(m,2,j,k)
+     >           - dsspm * ( + 5.0d+00 * rsd(m,2,j,k)
+     >                       - 4.0d+00 * rsd(m,3,j,k)
+     >                       +           rsd(m,4,j,k) )
+               frct(m,3,j,k) = frct(m,3,j,k)
+     >           - dsspm * ( - 4.0d+00 * rsd(m,2,j,k)
+     >                       + 6.0d+00 * rsd(m,3,j,k)
+     >                       - 4.0d+00 * rsd(m,4,j,k)
+     >                       +           rsd(m,5,j,k) )
+             end do
+            END IF
+
+            do i = ist1,iend1
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dsspm * (            rsd(m,i-2,j,k)
+     >                         - 4.0d+00 * rsd(m,i-1,j,k)
+     >                         + 6.0d+00 * rsd(m,i,j,k)
+     >                         - 4.0d+00 * rsd(m,i+1,j,k)
+     >                         +           rsd(m,i+2,j,k) )
+               end do
+            end do
+
+            IF (south.eq.-1) then
+             do m = 1, 5
+               frct(m,nx-2,j,k) = frct(m,nx-2,j,k)
+     >           - dsspm * (             rsd(m,nx-4,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-3,j,k)
+     >                       + 6.0d+00 * rsd(m,nx-2,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-1,j,k)  )
+               frct(m,nx-1,j,k) = frct(m,nx-1,j,k)
+     >           - dsspm * (             rsd(m,nx-3,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-2,j,k)
+     >                       + 5.0d+00 * rsd(m,nx-1,j,k) )
+             end do
+            END IF
+
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+c
+c   iex = flag : iex = 0  north/south communication
+c              : iex = 1  east/west communication
+c
+c---------------------------------------------------------------------
+      iex   = 1
+
+c---------------------------------------------------------------------
+c   communicate and receive/send two rows of data
+c---------------------------------------------------------------------
+      call exchange_3 (rsd,iex)
+
+      L1 = 0
+      if (west.eq.-1) L1 = 1
+      L2 = ny + 1
+      if (east.eq.-1) L2 = ny
+
+      jst1 = 1
+      jend1 = ny
+      if (west.eq.-1) jst1 = 4
+      if (east.eq.-1) jend1 = ny - 3
+
+      do k = 2, nz - 1
+         do j = L1, L2
+            do i = ist, iend
+               flux(1,i,j,k) = rsd(3,i,j,k)
+               u31 = rsd(3,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u31 
+               flux(3,i,j,k) = rsd(3,i,j,k) * u31 + c2 * 
+     >                       ( rsd(5,i,j,k) - q )
+               flux(4,i,j,k) = rsd(4,i,j,k) * u31
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u31
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = jst, jend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                 - ty2 * ( flux(m,i,j+1,k) - flux(m,i,j-1,k) )
+               end do
+            end do
+         end do
+
+         do j = jst, L2
+            do i = ist, iend
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21j = tmp * rsd(2,i,j,k)
+               u31j = tmp * rsd(3,i,j,k)
+               u41j = tmp * rsd(4,i,j,k)
+               u51j = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j-1,k)
+
+               u21jm1 = tmp * rsd(2,i,j-1,k)
+               u31jm1 = tmp * rsd(3,i,j-1,k)
+               u41jm1 = tmp * rsd(4,i,j-1,k)
+               u51jm1 = tmp * rsd(5,i,j-1,k)
+
+               flux(2,i,j,k) = ty3 * ( u21j - u21jm1 )
+               flux(3,i,j,k) = (4.0d+00/3.0d+00) * ty3 * 
+     >                       ( u31j - u31jm1 )
+               flux(4,i,j,k) = ty3 * ( u41j - u41jm1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dy1 * ty1 * (            rsd(1,i,j-1,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j+1,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,i,j+1,k) - flux(2,i,j,k) )
+     >              + dy2 * ty1 * (            rsd(2,i,j-1,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j+1,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,i,j+1,k) - flux(3,i,j,k) )
+     >              + dy3 * ty1 * (            rsd(3,i,j-1,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j+1,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,i,j+1,k) - flux(4,i,j,k) )
+     >              + dy4 * ty1 * (            rsd(4,i,j-1,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j+1,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,i,j+1,k) - flux(5,i,j,k) )
+     >              + dy5 * ty1 * (            rsd(5,i,j-1,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j+1,k) )
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         IF (west.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               frct(m,i,2,k) = frct(m,i,2,k)
+     >           - dsspm * ( + 5.0d+00 * rsd(m,i,2,k)
+     >                       - 4.0d+00 * rsd(m,i,3,k)
+     >                       +           rsd(m,i,4,k) )
+               frct(m,i,3,k) = frct(m,i,3,k)
+     >           - dsspm * ( - 4.0d+00 * rsd(m,i,2,k)
+     >                       + 6.0d+00 * rsd(m,i,3,k)
+     >                       - 4.0d+00 * rsd(m,i,4,k)
+     >                       +           rsd(m,i,5,k) )
+             end do
+            end do
+         END IF
+
+         do j = jst1, jend1
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dsspm * (            rsd(m,i,j-2,k)
+     >                        - 4.0d+00 * rsd(m,i,j-1,k)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j+1,k)
+     >                        +           rsd(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         IF (east.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               frct(m,i,ny-2,k) = frct(m,i,ny-2,k)
+     >           - dsspm * (             rsd(m,i,ny-4,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-3,k)
+     >                       + 6.0d+00 * rsd(m,i,ny-2,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-1,k)  )
+               frct(m,i,ny-1,k) = frct(m,i,ny-1,k)
+     >           - dsspm * (             rsd(m,i,ny-3,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-2,k)
+     >                       + 5.0d+00 * rsd(m,i,ny-1,k)  )
+             end do
+            end do
+         END IF
+
+      end do
+
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+      do k = 1, nz
+         do j = jst, jend
+            do i = ist, iend
+               flux(1,i,j,k) = rsd(4,i,j,k)
+               u41 = rsd(4,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u41 
+               flux(3,i,j,k) = rsd(3,i,j,k) * u41 
+               flux(4,i,j,k) = rsd(4,i,j,k) * u41 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u41
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                  - tz2 * ( flux(m,i,j,k+1) - flux(m,i,j,k-1) )
+               end do
+            end do
+         end do
+      end do
+
+      do k = 2, nz
+         do j = jst, jend
+            do i = ist, iend
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21k = tmp * rsd(2,i,j,k)
+               u31k = tmp * rsd(3,i,j,k)
+               u41k = tmp * rsd(4,i,j,k)
+               u51k = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j,k-1)
+
+               u21km1 = tmp * rsd(2,i,j,k-1)
+               u31km1 = tmp * rsd(3,i,j,k-1)
+               u41km1 = tmp * rsd(4,i,j,k-1)
+               u51km1 = tmp * rsd(5,i,j,k-1)
+
+               flux(2,i,j,k) = tz3 * ( u21k - u21km1 )
+               flux(3,i,j,k) = tz3 * ( u31k - u31km1 )
+               flux(4,i,j,k) = (4.0d+00/3.0d+00) * tz3 * ( u41k 
+     >                       - u41km1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dz1 * tz1 * (            rsd(1,i,j,k+1)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j,k-1) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,i,j,k+1) - flux(2,i,j,k) )
+     >              + dz2 * tz1 * (            rsd(2,i,j,k+1)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j,k-1) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,i,j,k+1) - flux(3,i,j,k) )
+     >              + dz3 * tz1 * (            rsd(3,i,j,k+1)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j,k-1) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,i,j,k+1) - flux(4,i,j,k) )
+     >              + dz4 * tz1 * (            rsd(4,i,j,k+1)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j,k-1) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,i,j,k+1) - flux(5,i,j,k) )
+     >              + dz5 * tz1 * (            rsd(5,i,j,k+1)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j,k-1) )
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               frct(m,i,j,2) = frct(m,i,j,2)
+     >           - dsspm * ( + 5.0d+00 * rsd(m,i,j,2)
+     >                       - 4.0d+00 * rsd(m,i,j,3)
+     >                       +           rsd(m,i,j,4) )
+               frct(m,i,j,3) = frct(m,i,j,3)
+     >           - dsspm * (- 4.0d+00 * rsd(m,i,j,2)
+     >                      + 6.0d+00 * rsd(m,i,j,3)
+     >                      - 4.0d+00 * rsd(m,i,j,4)
+     >                      +           rsd(m,i,j,5) )
+            end do
+         end do
+      end do
+
+      do k = 4, nz - 3
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dsspm * (           rsd(m,i,j,k-2)
+     >                        - 4.0d+00 * rsd(m,i,j,k-1)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j,k+1)
+     >                        +           rsd(m,i,j,k+2) )
+               end do
+            end do
+         end do
+      end do
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               frct(m,i,j,nz-2) = frct(m,i,j,nz-2)
+     >           - dsspm * (            rsd(m,i,j,nz-4)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-3)
+     >                      + 6.0d+00 * rsd(m,i,j,nz-2)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-1)  )
+               frct(m,i,j,nz-1) = frct(m,i,j,nz-1)
+     >           - dsspm * (             rsd(m,i,j,nz-3)
+     >                       - 4.0d+00 * rsd(m,i,j,nz-2)
+     >                       + 5.0d+00 * rsd(m,i,j,nz-1)  )
+            end do
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/error.f
new file mode 100644
index 0000000..e83f749
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/error.f
@@ -0,0 +1,81 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the solution error
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      double precision  tmp
+      double precision  u000ijk(5), dummy(5)
+
+      integer IERROR
+
+
+      do m = 1, 5
+         errnm(m) = 0.0d+00
+         dummy(m) = 0.0d+00
+      end do
+
+      do k = 2, nz-1
+         do j = jst, jend
+            jglob = jpt + j
+            do i = ist, iend
+               iglob = ipt + i
+               call exact( iglob, jglob, k, u000ijk )
+               do m = 1, 5
+                  tmp = ( u000ijk(m) - u(m,i,j,k) )
+                  dummy(m) = dummy(m) + tmp ** 2
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   compute the global sum of individual contributions to dot product.
+c---------------------------------------------------------------------
+      call MPI_ALLREDUCE( dummy,
+     >                    errnm,
+     >                    5,
+     >                    dp_type,
+     >                    MPI_SUM,
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+
+      do m = 1, 5
+         errnm(m) = sqrt ( errnm(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+c      if (id.eq.0) then
+c        write (*,1002) ( errnm(m), m = 1, 5 )
+c      end if
+
+ 1002 format (1x/1x,'RMS-norm of error in soln. to ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fifth pde  = ',1pe12.5)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exact.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exact.f
new file mode 100644
index 0000000..19e14c3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exact.f
@@ -0,0 +1,53 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact( i, j, k, u000ijk )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the exact solution at (i,j,k)
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer i, j, k
+      double precision u000ijk(*)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer m
+      double precision xi, eta, zeta
+
+      xi  = ( dble ( i - 1 ) ) / ( nx0 - 1 )
+      eta  = ( dble ( j - 1 ) ) / ( ny0 - 1 )
+      zeta = ( dble ( k - 1 ) ) / ( nz - 1 )
+
+
+      do m = 1, 5
+         u000ijk(m) =  ce(m,1)
+     >        + ce(m,2) * xi
+     >        + ce(m,3) * eta
+     >        + ce(m,4) * zeta
+     >        + ce(m,5) * xi * xi
+     >        + ce(m,6) * eta * eta
+     >        + ce(m,7) * zeta * zeta
+     >        + ce(m,8) * xi * xi * xi
+     >        + ce(m,9) * eta * eta * eta
+     >        + ce(m,10) * zeta * zeta * zeta
+     >        + ce(m,11) * xi * xi * xi * xi
+     >        + ce(m,12) * eta * eta * eta * eta
+     >        + ce(m,13) * zeta * zeta * zeta * zeta
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_1.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_1.f
new file mode 100644
index 0000000..2bf7d28
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_1.f
@@ -0,0 +1,180 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exchange_1( g,k,iex )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+      double precision  g(5,-1:isiz1+2,-1:isiz2+2,isiz3)
+      integer k
+      integer iex
+      integer i, j
+      double precision dum(5,isiz1+isiz2), dum1(5,isiz1+isiz2)
+
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      if( iex .eq. 0 ) then
+
+          if( north .ne. -1 ) then
+              call MPI_RECV( dum1(1,jst),
+     >                       5*(jend-jst+1),
+     >                       dp_type,
+     >                       north,
+     >                       from_n,
+     >                       MPI_COMM_WORLD,
+     >                       status,
+     >                       IERROR )
+              do j=jst,jend
+                  g(1,0,j,k) = dum1(1,j)
+                  g(2,0,j,k) = dum1(2,j)
+                  g(3,0,j,k) = dum1(3,j)
+                  g(4,0,j,k) = dum1(4,j)
+                  g(5,0,j,k) = dum1(5,j)
+              enddo
+          endif
+
+          if( west .ne. -1 ) then
+              call MPI_RECV( dum1(1,ist),
+     >                       5*(iend-ist+1),
+     >                       dp_type,
+     >                       west,
+     >                       from_w,
+     >                       MPI_COMM_WORLD,
+     >                       status,
+     >                       IERROR )
+              do i=ist,iend
+                  g(1,i,0,k) = dum1(1,i)
+                  g(2,i,0,k) = dum1(2,i)
+                  g(3,i,0,k) = dum1(3,i)
+                  g(4,i,0,k) = dum1(4,i)
+                  g(5,i,0,k) = dum1(5,i)
+              enddo
+          endif
+
+      else if( iex .eq. 1 ) then
+
+          if( south .ne. -1 ) then
+              call MPI_RECV( dum1(1,jst),
+     >                       5*(jend-jst+1),
+     >                       dp_type,
+     >                       south,
+     >                       from_s,
+     >                       MPI_COMM_WORLD,
+     >                       status,
+     >                       IERROR )
+              do j=jst,jend
+                  g(1,nx+1,j,k) = dum1(1,j)
+                  g(2,nx+1,j,k) = dum1(2,j)
+                  g(3,nx+1,j,k) = dum1(3,j)
+                  g(4,nx+1,j,k) = dum1(4,j)
+                  g(5,nx+1,j,k) = dum1(5,j)
+              enddo
+          endif
+
+          if( east .ne. -1 ) then
+              call MPI_RECV( dum1(1,ist),
+     >                       5*(iend-ist+1),
+     >                       dp_type,
+     >                       east,
+     >                       from_e,
+     >                       MPI_COMM_WORLD,
+     >                       status,
+     >                       IERROR )
+              do i=ist,iend
+                  g(1,i,ny+1,k) = dum1(1,i)
+                  g(2,i,ny+1,k) = dum1(2,i)
+                  g(3,i,ny+1,k) = dum1(3,i)
+                  g(4,i,ny+1,k) = dum1(4,i)
+                  g(5,i,ny+1,k) = dum1(5,i)
+              enddo
+          endif
+
+      else if( iex .eq. 2 ) then
+
+          if( south .ne. -1 ) then
+              do j=jst,jend
+                  dum(1,j) = g(1,nx,j,k) 
+                  dum(2,j) = g(2,nx,j,k) 
+                  dum(3,j) = g(3,nx,j,k) 
+                  dum(4,j) = g(4,nx,j,k) 
+                  dum(5,j) = g(5,nx,j,k) 
+              enddo
+              call MPI_SEND( dum(1,jst), 
+     >                       5*(jend-jst+1), 
+     >                       dp_type, 
+     >                       south, 
+     >                       from_n, 
+     >                       MPI_COMM_WORLD, 
+     >                       IERROR )
+          endif
+
+          if( east .ne. -1 ) then
+              do i=ist,iend
+                  dum(1,i) = g(1,i,ny,k)
+                  dum(2,i) = g(2,i,ny,k)
+                  dum(3,i) = g(3,i,ny,k)
+                  dum(4,i) = g(4,i,ny,k)
+                  dum(5,i) = g(5,i,ny,k)
+              enddo
+              call MPI_SEND( dum(1,ist), 
+     >                       5*(iend-ist+1), 
+     >                       dp_type, 
+     >                       east, 
+     >                       from_w, 
+     >                       MPI_COMM_WORLD, 
+     >                       IERROR )
+          endif
+
+      else
+
+          if( north .ne. -1 ) then
+              do j=jst,jend
+                  dum(1,j) = g(1,1,j,k)
+                  dum(2,j) = g(2,1,j,k)
+                  dum(3,j) = g(3,1,j,k)
+                  dum(4,j) = g(4,1,j,k)
+                  dum(5,j) = g(5,1,j,k)
+              enddo
+              call MPI_SEND( dum(1,jst), 
+     >                       5*(jend-jst+1), 
+     >                       dp_type, 
+     >                       north, 
+     >                       from_s, 
+     >                       MPI_COMM_WORLD, 
+     >                       IERROR )
+          endif
+
+          if( west .ne. -1 ) then
+              do i=ist,iend
+                  dum(1,i) = g(1,i,1,k)
+                  dum(2,i) = g(2,i,1,k)
+                  dum(3,i) = g(3,i,1,k)
+                  dum(4,i) = g(4,i,1,k)
+                  dum(5,i) = g(5,i,1,k)
+              enddo
+              call MPI_SEND( dum(1,ist), 
+     >                       5*(iend-ist+1), 
+     >                       dp_type, 
+     >                       west, 
+     >                       from_e, 
+     >                       MPI_COMM_WORLD, 
+     >                       IERROR )
+          endif
+
+      endif
+
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_3.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_3.f
new file mode 100644
index 0000000..ae050dd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_3.f
@@ -0,0 +1,312 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exchange_3(g,iex)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      double precision  g(5,-1:isiz1+2,-1:isiz2+2,isiz3)
+      integer iex
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k
+      integer ipos1, ipos2
+
+      integer mid
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      if (iex.eq.0) then
+c---------------------------------------------------------------------
+c   communicate in the south and north directions
+c---------------------------------------------------------------------
+      if (north.ne.-1) then
+          call MPI_IRECV( buf1,
+     >                    10*ny*nz,
+     >                    dp_type,
+     >                    north,
+     >                    from_n,
+     >                    MPI_COMM_WORLD,
+     >                    mid,
+     >                    IERROR )
+      end if
+
+c---------------------------------------------------------------------
+c   send south
+c---------------------------------------------------------------------
+      if (south.ne.-1) then
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              buf(1,ipos1) = g(1,nx-1,j,k) 
+              buf(2,ipos1) = g(2,nx-1,j,k) 
+              buf(3,ipos1) = g(3,nx-1,j,k) 
+              buf(4,ipos1) = g(4,nx-1,j,k) 
+              buf(5,ipos1) = g(5,nx-1,j,k) 
+              buf(1,ipos2) = g(1,nx,j,k)
+              buf(2,ipos2) = g(2,nx,j,k)
+              buf(3,ipos2) = g(3,nx,j,k)
+              buf(4,ipos2) = g(4,nx,j,k)
+              buf(5,ipos2) = g(5,nx,j,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,
+     >                   10*ny*nz,
+     >                   dp_type,
+     >                   south,
+     >                   from_n,
+     >                   MPI_COMM_WORLD,
+     >                   IERROR )
+        end if
+
+c---------------------------------------------------------------------
+c   receive from north
+c---------------------------------------------------------------------
+        if (north.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              g(1,-1,j,k) = buf1(1,ipos1)
+              g(2,-1,j,k) = buf1(2,ipos1)
+              g(3,-1,j,k) = buf1(3,ipos1)
+              g(4,-1,j,k) = buf1(4,ipos1)
+              g(5,-1,j,k) = buf1(5,ipos1)
+              g(1,0,j,k) = buf1(1,ipos2)
+              g(2,0,j,k) = buf1(2,ipos2)
+              g(3,0,j,k) = buf1(3,ipos2)
+              g(4,0,j,k) = buf1(4,ipos2)
+              g(5,0,j,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      if (south.ne.-1) then
+          call MPI_IRECV( buf1,
+     >                    10*ny*nz,
+     >                    dp_type,
+     >                    south,
+     >                    from_s,
+     >                    MPI_COMM_WORLD,
+     >                    mid,
+     >                    IERROR )
+      end if
+
+c---------------------------------------------------------------------
+c   send north
+c---------------------------------------------------------------------
+        if (north.ne.-1) then
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              buf(1,ipos1) = g(1,2,j,k)
+              buf(2,ipos1) = g(2,2,j,k)
+              buf(3,ipos1) = g(3,2,j,k)
+              buf(4,ipos1) = g(4,2,j,k)
+              buf(5,ipos1) = g(5,2,j,k)
+              buf(1,ipos2) = g(1,1,j,k)
+              buf(2,ipos2) = g(2,1,j,k)
+              buf(3,ipos2) = g(3,1,j,k)
+              buf(4,ipos2) = g(4,1,j,k)
+              buf(5,ipos2) = g(5,1,j,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,
+     >                   10*ny*nz,
+     >                   dp_type,
+     >                   north,
+     >                   from_s,
+     >                   MPI_COMM_WORLD,
+     >                   IERROR )
+        end if
+
+c---------------------------------------------------------------------
+c   receive from south
+c---------------------------------------------------------------------
+        if (south.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              g(1,nx+2,j,k)  = buf1(1,ipos1)
+              g(2,nx+2,j,k)  = buf1(2,ipos1)
+              g(3,nx+2,j,k)  = buf1(3,ipos1)
+              g(4,nx+2,j,k)  = buf1(4,ipos1)
+              g(5,nx+2,j,k)  = buf1(5,ipos1)
+              g(1,nx+1,j,k) = buf1(1,ipos2)
+              g(2,nx+1,j,k) = buf1(2,ipos2)
+              g(3,nx+1,j,k) = buf1(3,ipos2)
+              g(4,nx+1,j,k) = buf1(4,ipos2)
+              g(5,nx+1,j,k) = buf1(5,ipos2)
+            end do
+          end do
+        end if
+
+      else
+
+c---------------------------------------------------------------------
+c   communicate in the east and west directions
+c---------------------------------------------------------------------
+      if (west.ne.-1) then
+          call MPI_IRECV( buf1,
+     >                    10*nx*nz,
+     >                    dp_type,
+     >                    west,
+     >                    from_w,
+     >                    MPI_COMM_WORLD,
+     >                    mid,
+     >                    IERROR )
+      end if
+
+c---------------------------------------------------------------------
+c   send east
+c---------------------------------------------------------------------
+        if (east.ne.-1) then
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              buf(1,ipos1) = g(1,i,ny-1,k)
+              buf(2,ipos1) = g(2,i,ny-1,k)
+              buf(3,ipos1) = g(3,i,ny-1,k)
+              buf(4,ipos1) = g(4,i,ny-1,k)
+              buf(5,ipos1) = g(5,i,ny-1,k)
+              buf(1,ipos2) = g(1,i,ny,k)
+              buf(2,ipos2) = g(2,i,ny,k)
+              buf(3,ipos2) = g(3,i,ny,k)
+              buf(4,ipos2) = g(4,i,ny,k)
+              buf(5,ipos2) = g(5,i,ny,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,
+     >                   10*nx*nz,
+     >                   dp_type,
+     >                   east,
+     >                   from_w,
+     >                   MPI_COMM_WORLD,
+     >                   IERROR )
+        end if
+
+c---------------------------------------------------------------------
+c   receive from west
+c---------------------------------------------------------------------
+        if (west.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              g(1,i,-1,k) = buf1(1,ipos1)
+              g(2,i,-1,k) = buf1(2,ipos1)
+              g(3,i,-1,k) = buf1(3,ipos1)
+              g(4,i,-1,k) = buf1(4,ipos1)
+              g(5,i,-1,k) = buf1(5,ipos1)
+              g(1,i,0,k) = buf1(1,ipos2)
+              g(2,i,0,k) = buf1(2,ipos2)
+              g(3,i,0,k) = buf1(3,ipos2)
+              g(4,i,0,k) = buf1(4,ipos2)
+              g(5,i,0,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      if (east.ne.-1) then
+          call MPI_IRECV( buf1,
+     >                    10*nx*nz,
+     >                    dp_type,
+     >                    east,
+     >                    from_e,
+     >                    MPI_COMM_WORLD,
+     >                    mid,
+     >                    IERROR )
+      end if
+
+c---------------------------------------------------------------------
+c   send west
+c---------------------------------------------------------------------
+      if (west.ne.-1) then
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              buf(1,ipos1) = g(1,i,2,k)
+              buf(2,ipos1) = g(2,i,2,k)
+              buf(3,ipos1) = g(3,i,2,k)
+              buf(4,ipos1) = g(4,i,2,k)
+              buf(5,ipos1) = g(5,i,2,k)
+              buf(1,ipos2) = g(1,i,1,k)
+              buf(2,ipos2) = g(2,i,1,k)
+              buf(3,ipos2) = g(3,i,1,k)
+              buf(4,ipos2) = g(4,i,1,k)
+              buf(5,ipos2) = g(5,i,1,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,
+     >                   10*nx*nz,
+     >                   dp_type,
+     >                   west,
+     >                   from_e,
+     >                   MPI_COMM_WORLD,
+     >                   IERROR )
+        end if
+
+c---------------------------------------------------------------------
+c   receive from east
+c---------------------------------------------------------------------
+        if (east.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              g(1,i,ny+2,k)  = buf1(1,ipos1)
+              g(2,i,ny+2,k)  = buf1(2,ipos1)
+              g(3,i,ny+2,k)  = buf1(3,ipos1)
+              g(4,i,ny+2,k)  = buf1(4,ipos1)
+              g(5,i,ny+2,k)  = buf1(5,ipos1)
+              g(1,i,ny+1,k) = buf1(1,ipos2)
+              g(2,i,ny+1,k) = buf1(2,ipos2)
+              g(3,i,ny+1,k) = buf1(3,ipos2)
+              g(4,i,ny+1,k) = buf1(4,ipos2)
+              g(5,i,ny+1,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      end if
+
+      return
+      end     
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_4.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_4.f
new file mode 100644
index 0000000..d6dbb2e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_4.f
@@ -0,0 +1,133 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exchange_4(g,h,ibeg,ifin1,jbeg,jfin1)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      double precision  g(0:isiz2+1,0:isiz3+1), 
+     >        h(0:isiz2+1,0:isiz3+1)
+      integer ibeg, ifin1
+      integer jbeg, jfin1
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      integer ny2
+      double precision  dum(1024)
+
+      integer msgid1, msgid3
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      ny2 = ny + 2
+
+c---------------------------------------------------------------------
+c   communicate in the east and west directions
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   receive from east
+c---------------------------------------------------------------------
+      if (jfin1.eq.ny) then
+        call MPI_IRECV( dum,
+     >                  2*nx,
+     >                  dp_type,
+     >                  east,
+     >                  from_e,
+     >                  MPI_COMM_WORLD,
+     >                  msgid3,
+     >                  IERROR )
+
+        call MPI_WAIT( msgid3, STATUS, IERROR )
+
+        do i = 1,nx
+          g(i,ny+1) = dum(i)
+          h(i,ny+1) = dum(i+nx)
+        end do
+
+      end if
+
+c---------------------------------------------------------------------
+c   send west
+c---------------------------------------------------------------------
+      if (jbeg.eq.1) then
+        do i = 1,nx
+          dum(i) = g(i,1)
+          dum(i+nx) = h(i,1)
+        end do
+
+        call MPI_SEND( dum,
+     >                 2*nx,
+     >                 dp_type,
+     >                 west,
+     >                 from_e,
+     >                 MPI_COMM_WORLD,
+     >                 IERROR )
+
+      end if
+
+c---------------------------------------------------------------------
+c   communicate in the south and north directions
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   receive from south
+c---------------------------------------------------------------------
+      if (ifin1.eq.nx) then
+        call MPI_IRECV( dum,
+     >                  2*ny2,
+     >                  dp_type,
+     >                  south,
+     >                  from_s,
+     >                  MPI_COMM_WORLD,
+     >                  msgid1,
+     >                  IERROR )
+
+        call MPI_WAIT( msgid1, STATUS, IERROR )
+
+        do j = 0,ny+1
+          g(nx+1,j) = dum(j+1)
+          h(nx+1,j) = dum(j+ny2+1)
+        end do
+
+      end if
+
+c---------------------------------------------------------------------
+c   send north
+c---------------------------------------------------------------------
+      if (ibeg.eq.1) then
+        do j = 0,ny+1
+          dum(j+1) = g(1,j)
+          dum(j+ny2+1) = h(1,j)
+        end do
+
+        call MPI_SEND( dum,
+     >                 2*ny2,
+     >                 dp_type,
+     >                 north,
+     >                 from_s,
+     >                 MPI_COMM_WORLD,
+     >                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_5.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_5.f
new file mode 100644
index 0000000..2968544
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_5.f
@@ -0,0 +1,81 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exchange_5(g,ibeg,ifin1)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      double precision  g(0:isiz2+1,0:isiz3+1)
+      integer ibeg, ifin1
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer k
+      double precision  dum(1024)
+
+      integer msgid1
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+c---------------------------------------------------------------------
+c   communicate in the south and north directions
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   receive from south
+c---------------------------------------------------------------------
+      if (ifin1.eq.nx) then
+        call MPI_IRECV( dum,
+     >                  nz,
+     >                  dp_type,
+     >                  south,
+     >                  from_s,
+     >                  MPI_COMM_WORLD,
+     >                  msgid1,
+     >                  IERROR )
+
+        call MPI_WAIT( msgid1, STATUS, IERROR )
+
+        do k = 1,nz
+          g(nx+1,k) = dum(k)
+        end do
+
+      end if
+
+c---------------------------------------------------------------------
+c   send north
+c---------------------------------------------------------------------
+      if (ibeg.eq.1) then
+        do k = 1,nz
+          dum(k) = g(1,k)
+        end do
+
+        call MPI_SEND( dum,
+     >                 nz,
+     >                 dp_type,
+     >                 north,
+     >                 from_s,
+     >                 MPI_COMM_WORLD,
+     >                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_6.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_6.f
new file mode 100644
index 0000000..c50cd63
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/exchange_6.f
@@ -0,0 +1,81 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exchange_6(g,jbeg,jfin1)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      double precision  g(0:isiz2+1,0:isiz3+1)
+      integer jbeg, jfin1
+
+c---------------------------------------------------------------------
+c  local parameters
+c---------------------------------------------------------------------
+      integer k
+      double precision  dum(1024)
+
+      integer msgid3
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+c---------------------------------------------------------------------
+c   communicate in the east and west directions
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   receive from east
+c---------------------------------------------------------------------
+      if (jfin1.eq.ny) then
+        call MPI_IRECV( dum,
+     >                  nz,
+     >                  dp_type,
+     >                  east,
+     >                  from_e,
+     >                  MPI_COMM_WORLD,
+     >                  msgid3,
+     >                  IERROR )
+
+        call MPI_WAIT( msgid3, STATUS, IERROR )
+
+        do k = 1,nz
+          g(ny+1,k) = dum(k)
+        end do
+
+      end if
+
+c---------------------------------------------------------------------
+c   send west
+c---------------------------------------------------------------------
+      if (jbeg.eq.1) then
+        do k = 1,nz
+          dum(k) = g(1,k)
+        end do
+
+        call MPI_SEND( dum,
+     >                 nz,
+     >                 dp_type,
+     >                 west,
+     >                 from_e,
+     >                 MPI_COMM_WORLD,
+     >                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/init_comm.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/init_comm.f
new file mode 100644
index 0000000..d9abca1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/init_comm.f
@@ -0,0 +1,64 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine init_comm 
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   initialize MPI and establish rank and size
+c
+c This is a module in the MPI implementation of LUSSOR
+c pseudo application from the NAS Parallel Benchmarks. 
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+      integer nodedim
+      integer IERROR
+
+
+c---------------------------------------------------------------------
+c    initialize MPI communication
+c---------------------------------------------------------------------
+      call MPI_INIT( IERROR )
+
+c---------------------------------------------------------------------
+c   establish the global rank of this process
+c---------------------------------------------------------------------
+      call MPI_COMM_RANK( MPI_COMM_WORLD,
+     >                     id,
+     >                     IERROR )
+
+c---------------------------------------------------------------------
+c   establish the size of the global group
+c---------------------------------------------------------------------
+      call MPI_COMM_SIZE( MPI_COMM_WORLD,
+     >                     num,
+     >                     IERROR )
+
+      if (num .lt. nnodes_compiled) then
+         if (id .eq. 0) write (*,2000) num, nnodes_compiled
+2000     format(' Error: number of processes',i6,
+     >          ' less than compiled',i6)
+         CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+      endif
+
+      ndim   = nodedim(num)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/inputlu.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/inputlu.data.sample
new file mode 100644
index 0000000..9ef5a7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/inputlu.data.sample
@@ -0,0 +1,24 @@
+c
+c***controls printing of the progress of iterations: ipr    inorm
+                                                      1      250
+c
+c***the maximum no. of pseudo-time steps to be performed: nitmax
+                                                             250
+c
+c***magnitude of the time step: dt 
+                               2.0e+00
+c
+c***relaxation factor for SSOR iterations: omega
+                                            1.2
+c
+c***tolerance levels for steady-state residuals: tolnwt(m),m=1,5
+                             1.0e-08   1.0e-08   1.0e-08  1.0e-08  1.0e-08 
+c
+c***number of grid points in xi and eta and zeta directions: nx   ny   nz
+                                                            64  64  64
+c
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacld.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacld.f
new file mode 100644
index 0000000..053de3c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacld.f
@@ -0,0 +1,387 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacld(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   compute the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      if (timeron) call timer_start(t_jacld)
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) =  dt * 2.0d+00
+     >          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )
+     >             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )
+     >             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 
+     >          * (  tx1 * r43 * c34 * tmp1
+     >             + ty1 *       c34 * tmp1
+     >             + tz1 *       c34 * tmp1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = dt * 2.0d+00
+     >      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )
+     >         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )
+     >         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00
+     >              * (  tx1 *       c34 * tmp1
+     >                 + ty1 * r43 * c34 * tmp1
+     >                 + tz1 *       c34 * tmp1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = dt * 2.0d+00
+     >      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )
+     >         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )
+     >         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00
+     >              * (  tx1 *       c34 * tmp1
+     >                 + ty1 *       c34 * tmp1
+     >                 + tz1 * r43 * c34 * tmp1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) )
+     >   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) )
+     >   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)
+     >   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)
+     >   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)
+     >   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)
+     >   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)
+     >   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)
+     >   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1
+     >                    + ty1 * c1345 * tmp1
+     >                    + tz1 * c1345 * tmp1 )
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tz1 * dz1
+               a(1,2,i,j) =   0.0d+00
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) = - dt * tz2
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) = - dt * tz2
+     >           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               a(2,3,i,j) = 0.0d+00
+               a(2,4,i,j) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i,j) = 0.0d+00
+
+               a(3,1,i,j) = - dt * tz2
+     >           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i,j) = 0.0d+00
+               a(3,3,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               a(3,4,i,j) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = - dt * tz2
+     >        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2
+     >            + 0.50d+00 * c2
+     >            * ( ( u(2,i,j,k-1) * u(2,i,j,k-1)
+     >                + u(3,i,j,k-1) * u(3,i,j,k-1)
+     >                + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2 ) )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i,j) = - dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k-1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               a(4,5,i,j) = - dt * tz2 * c2
+
+               a(5,1,i,j) = - dt * tz2
+     >     * ( ( c2 * (  u(2,i,j,k-1) * u(2,i,j,k-1)
+     >                 + u(3,i,j,k-1) * u(3,i,j,k-1)
+     >                 + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2
+     >       - c1 * ( u(5,i,j,k-1) * tmp1 ) )
+     >            * ( u(4,i,j,k-1) * tmp1 ) )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i,j) = - dt * tz2
+     >       * ( c1 * ( u(5,i,j,k-1) * tmp1 )
+     >       - 0.50d+00 * c2
+     >       * ( (  u(2,i,j,k-1)*u(2,i,j,k-1)
+     >            + u(3,i,j,k-1)*u(3,i,j,k-1)
+     >            + 3.0d+00*u(4,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i,j) = - dt * tz2
+     >       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) = - dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) = - dt * ty2
+     >           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               b(2,3,i,j) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) = - dt * ty2
+     >           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2
+     >      + 0.50d+00 * c2 * ( (  u(2,i,j-1,k) * u(2,i,j-1,k)
+     >                           + u(3,i,j-1,k) * u(3,i,j-1,k)
+     >                           + u(4,i,j-1,k) * u(4,i,j-1,k) )
+     >                          * tmp2 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i,j) = - dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j-1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               b(3,4,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i,j) = - dt * ty2 * c2
+
+               b(4,1,i,j) = - dt * ty2
+     >              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) = - dt * ty2
+     >          * ( ( c2 * (  u(2,i,j-1,k) * u(2,i,j-1,k)
+     >                      + u(3,i,j-1,k) * u(3,i,j-1,k)
+     >                      + u(4,i,j-1,k) * u(4,i,j-1,k) ) * tmp2
+     >               - c1 * ( u(5,i,j-1,k) * tmp1 ) )
+     >          * ( u(3,i,j-1,k) * tmp1 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)
+     >              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i,j) = - dt * ty2
+     >          * ( c1 * ( u(5,i,j-1,k) * tmp1 )
+     >          - 0.50d+00 * c2 
+     >          * ( (  u(2,i,j-1,k)*u(2,i,j-1,k)
+     >               + 3.0d+00 * u(3,i,j-1,k)*u(3,i,j-1,k)
+     >               + u(4,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i,j) = - dt * ty2
+     >          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tx1 * dx1
+               c(1,2,i,j) = - dt * tx2
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) =   0.0d+00
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2
+     >     + c2 * 0.50d+00 * (  u(2,i-1,j,k) * u(2,i-1,j,k)
+     >                        + u(3,i-1,j,k) * u(3,i-1,j,k)
+     >                        + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i,j) = - dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               c(2,3,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i,j) = - dt * tx2 * c2 
+
+               c(3,1,i,j) = - dt * tx2
+     >              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i,j) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               c(3,4,i,j) = 0.0d+00
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i,j) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i,j) = 0.0d+00
+               c(4,4,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               c(4,5,i,j) = 0.0d+00
+
+               c(5,1,i,j) = - dt * tx2
+     >          * ( ( c2 * (  u(2,i-1,j,k) * u(2,i-1,j,k)
+     >                      + u(3,i-1,j,k) * u(3,i-1,j,k)
+     >                      + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2
+     >              - c1 * ( u(5,i-1,j,k) * tmp1 ) )
+     >          * ( u(2,i-1,j,k) * tmp1 ) )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i,j) = - dt * tx2
+     >          * ( c1 * ( u(5,i-1,j,k) * tmp1 )
+     >             - 0.50d+00 * c2
+     >             * ( (  3.0d+00*u(2,i-1,j,k)*u(2,i-1,j,k)
+     >                  + u(3,i-1,j,k)*u(3,i-1,j,k)
+     >                  + u(4,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i,j) = - dt * tx2
+     >           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+            end do
+         end do
+
+      if (timeron) call timer_stop(t_jacld)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacu.f
new file mode 100644
index 0000000..1c6fc1d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/jacu.f
@@ -0,0 +1,387 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacu(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      if (timeron) call timer_start(t_jacu)
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) =  dt * 2.0d+00
+     >          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )
+     >             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )
+     >             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 
+     >          * (  tx1 * r43 * c34 * tmp1
+     >             + ty1 *       c34 * tmp1
+     >             + tz1 *       c34 * tmp1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = dt * 2.0d+00
+     >      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )
+     >         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )
+     >         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00
+     >              * (  tx1 *       c34 * tmp1
+     >                 + ty1 * r43 * c34 * tmp1
+     >                 + tz1 *       c34 * tmp1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = dt * 2.0d+00
+     >      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )
+     >         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )
+     >         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00
+     >              * (  tx1 *       c34 * tmp1
+     >                 + ty1 *       c34 * tmp1
+     >                 + tz1 * r43 * c34 * tmp1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) )
+     >   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) )
+     >   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )
+     >             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )
+     >             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )
+     >             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)
+     >   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)
+     >   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)
+     >   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)
+     >   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)
+     >   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)
+     >   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1
+     >                    + ty1 * c1345 * tmp1
+     >                    + tz1 * c1345 * tmp1 )
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tx1 * dx1
+               a(1,2,i,j) =   dt * tx2
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) =   0.0d+00
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) =  dt * tx2
+     >          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2
+     >     + c2 * 0.50d+00 * (  u(2,i+1,j,k) * u(2,i+1,j,k)
+     >                        + u(3,i+1,j,k) * u(3,i+1,j,k)
+     >                        + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               a(2,2,i,j) =  dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               a(2,3,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               a(2,4,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               a(2,5,i,j) =  dt * tx2 * c2 
+
+               a(3,1,i,j) =  dt * tx2
+     >              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               a(3,2,i,j) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               a(3,3,i,j) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               a(3,4,i,j) = 0.0d+00
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = dt * tx2
+     >          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               a(4,2,i,j) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               a(4,3,i,j) = 0.0d+00
+               a(4,4,i,j) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               a(4,5,i,j) = 0.0d+00
+
+               a(5,1,i,j) = dt * tx2
+     >          * ( ( c2 * (  u(2,i+1,j,k) * u(2,i+1,j,k)
+     >                      + u(3,i+1,j,k) * u(3,i+1,j,k)
+     >                      + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2
+     >              - c1 * ( u(5,i+1,j,k) * tmp1 ) )
+     >          * ( u(2,i+1,j,k) * tmp1 ) )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i+1,j,k) )
+               a(5,2,i,j) = dt * tx2
+     >          * ( c1 * ( u(5,i+1,j,k) * tmp1 )
+     >             - 0.50d+00 * c2
+     >             * ( (  3.0d+00*u(2,i+1,j,k)*u(2,i+1,j,k)
+     >                  + u(3,i+1,j,k)*u(3,i+1,j,k)
+     >                  + u(4,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               a(5,3,i,j) = dt * tx2
+     >           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               a(5,4,i,j) = dt * tx2
+     >           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               a(5,5,i,j) = dt * tx2
+     >           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) =  dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) =  dt * ty2
+     >           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               b(2,2,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               b(2,3,i,j) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) =  dt * ty2
+     >           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2
+     >      + 0.50d+00 * c2 * ( (  u(2,i,j+1,k) * u(2,i,j+1,k)
+     >                           + u(3,i,j+1,k) * u(3,i,j+1,k)
+     >                           + u(4,i,j+1,k) * u(4,i,j+1,k) )
+     >                          * tmp2 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               b(3,2,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               b(3,3,i,j) =  dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j+1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               b(3,4,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               b(3,5,i,j) =  dt * ty2 * c2
+
+               b(4,1,i,j) =  dt * ty2
+     >              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               b(4,4,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) =  dt * ty2
+     >          * ( ( c2 * (  u(2,i,j+1,k) * u(2,i,j+1,k)
+     >                      + u(3,i,j+1,k) * u(3,i,j+1,k)
+     >                      + u(4,i,j+1,k) * u(4,i,j+1,k) ) * tmp2
+     >               - c1 * ( u(5,i,j+1,k) * tmp1 ) )
+     >          * ( u(3,i,j+1,k) * tmp1 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)
+     >              - c1345*tmp2*u(5,i,j+1,k) )
+               b(5,2,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               b(5,3,i,j) =  dt * ty2
+     >          * ( c1 * ( u(5,i,j+1,k) * tmp1 )
+     >          - 0.50d+00 * c2 
+     >          * ( (  u(2,i,j+1,k)*u(2,i,j+1,k)
+     >               + 3.0d+00 * u(3,i,j+1,k)*u(3,i,j+1,k)
+     >               + u(4,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               b(5,4,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               b(5,5,i,j) =  dt * ty2
+     >          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tz1 * dz1
+               c(1,2,i,j) =   0.0d+00
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) = dt * tz2
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = dt * tz2
+     >           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               c(2,2,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               c(2,3,i,j) = 0.0d+00
+               c(2,4,i,j) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               c(2,5,i,j) = 0.0d+00
+
+               c(3,1,i,j) = dt * tz2
+     >           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               c(3,2,i,j) = 0.0d+00
+               c(3,3,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               c(3,4,i,j) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = dt * tz2
+     >        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2
+     >            + 0.50d+00 * c2
+     >            * ( ( u(2,i,j,k+1) * u(2,i,j,k+1)
+     >                + u(3,i,j,k+1) * u(3,i,j,k+1)
+     >                + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2 ) )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               c(4,2,i,j) = dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               c(4,3,i,j) = dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               c(4,4,i,j) = dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k+1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               c(4,5,i,j) = dt * tz2 * c2
+
+               c(5,1,i,j) = dt * tz2
+     >     * ( ( c2 * (  u(2,i,j,k+1) * u(2,i,j,k+1)
+     >                 + u(3,i,j,k+1) * u(3,i,j,k+1)
+     >                 + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2
+     >       - c1 * ( u(5,i,j,k+1) * tmp1 ) )
+     >            * ( u(4,i,j,k+1) * tmp1 ) )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k+1) )
+               c(5,2,i,j) = dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               c(5,3,i,j) = dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               c(5,4,i,j) = dt * tz2
+     >       * ( c1 * ( u(5,i,j,k+1) * tmp1 )
+     >       - 0.50d+00 * c2
+     >       * ( (  u(2,i,j,k+1)*u(2,i,j,k+1)
+     >            + u(3,i,j,k+1)*u(3,i,j,k+1)
+     >            + 3.0d+00*u(4,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               c(5,5,i,j) = dt * tz2
+     >       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+            end do
+         end do
+
+      if (timeron) call timer_stop(t_jacu)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/l2norm.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/l2norm.f
new file mode 100644
index 0000000..998687f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/l2norm.f
@@ -0,0 +1,71 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine l2norm ( ldx, ldy, ldz, 
+     >                    nx0, ny0, nz0,
+     >                    ist, iend, 
+     >                    jst, jend,
+     >                    v, sum )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to compute the l2-norm of vector v.
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'timing.h'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldx, ldy, ldz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+      double precision  v(5,-1:ldx+2,-1:ldy+2,*), sum(5)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  dummy(5)
+
+      integer IERROR
+
+
+      do m = 1, 5
+         dummy(m) = 0.0d+00
+      end do
+
+      do k = 2, nz0-1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  dummy(m) = dummy(m) + v(m,i,j,k) * v(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   compute the global sum of individual contributions to dot product.
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_rcomm)
+      call MPI_ALLREDUCE( dummy,
+     >                    sum,
+     >                    5,
+     >                    dp_type,
+     >                    MPI_SUM,
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+      if (timeron) call timer_stop(t_rcomm)
+
+      do m = 1, 5
+         sum(m) = sqrt ( sum(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/lu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/lu.f
new file mode 100644
index 0000000..efd2bba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/lu.f
@@ -0,0 +1,199 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   L U                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: S. Weeratunga
+c          V. Venkatakrishnan
+c          E. Barszcz
+c          M. Yarrow
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program applu
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   driver for the performance evaluation of the solver for
+c   five coupled parabolic/elliptic partial differential equations.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+      character class
+      logical verified
+      double precision mflops, timer_read
+      integer i, ierr
+      double precision tsum(t_last+2), t1(t_last+2),
+     >                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'rhs', 'blts', 'buts', 'jacld', 'jacu', 
+     >            'exch', 'lcomm', 'ucomm', 'rcomm',
+     >            ' totcomp', ' totcomm'/
+
+c---------------------------------------------------------------------
+c   initialize communications
+c---------------------------------------------------------------------
+      call init_comm()
+
+c---------------------------------------------------------------------
+c   read input data
+c---------------------------------------------------------------------
+      call read_input()
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c   set up processor grid
+c---------------------------------------------------------------------
+      call proc_grid()
+
+c---------------------------------------------------------------------
+c   determine the neighbors
+c---------------------------------------------------------------------
+      call neighbors()
+
+c---------------------------------------------------------------------
+c   set up sub-domain sizes
+c---------------------------------------------------------------------
+      call subdomain()
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      call setcoeff()
+
+c---------------------------------------------------------------------
+c   set the boundary values for dependent variables
+c---------------------------------------------------------------------
+      call setbv()
+
+c---------------------------------------------------------------------
+c   set the initial values for dependent variables
+c---------------------------------------------------------------------
+      call setiv()
+
+c---------------------------------------------------------------------
+c   compute the forcing term based on prescribed exact solution
+c---------------------------------------------------------------------
+      call erhs()
+
+c---------------------------------------------------------------------
+c   perform one SSOR iteration to touch all data and program pages 
+c---------------------------------------------------------------------
+      call ssor(1)
+
+c---------------------------------------------------------------------
+c   reset the boundary and initial values
+c---------------------------------------------------------------------
+      call setbv()
+      call setiv()
+
+c---------------------------------------------------------------------
+c   perform the SSOR iterations
+c---------------------------------------------------------------------
+      call ssor(itmax)
+
+c---------------------------------------------------------------------
+c   compute the solution error
+c---------------------------------------------------------------------
+      call error()
+
+c---------------------------------------------------------------------
+c   compute the surface integral
+c---------------------------------------------------------------------
+      call pintgr()
+
+c---------------------------------------------------------------------
+c   verification test
+c---------------------------------------------------------------------
+      IF (id.eq.0) THEN
+         call verify ( rsdnm, errnm, frc, class, verified )
+         mflops = float(itmax)*(1984.77*float( nx0 )
+     >        *float( ny0 )
+     >        *float( nz0 )
+     >        -10923.3*(float( nx0+ny0+nz0 )/3.)**2 
+     >        +27770.9* float( nx0+ny0+nz0 )/3.
+     >        -144010.)
+     >        / (maxtime*1000000.)
+
+         call print_results('LU', class, nx0,
+     >     ny0, nz0, itmax, nnodes_compiled,
+     >     num, maxtime, mflops, '          floating point', verified, 
+     >     npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, 
+     >     '(none)')
+
+      END IF
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_rhs) = t1(t_rhs) - t1(t_exch)
+      t1(t_last+2) = t1(t_lcomm)+t1(t_ucomm)+t1(t_rcomm)+t1(t_exch)
+      t1(t_last+1) = t1(t_total) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                0, MPI_COMM_WORLD, ierr)
+
+      if (id .eq. 0) then
+         write(*, 800) num
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / num
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/mpinpb.h
new file mode 100644
index 0000000..ddbf151
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/mpinpb.h
@@ -0,0 +1,11 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'mpif.h'
+
+      integer           node, no_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type
+      common /mpistuff/ node, no_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/neighbors.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/neighbors.f
new file mode 100644
index 0000000..ed8a312
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/neighbors.f
@@ -0,0 +1,48 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine neighbors ()
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c     figure out the neighbors and their wrap numbers for each processor
+c---------------------------------------------------------------------
+
+        south = -1
+        east  = -1
+        north = -1
+        west  = -1
+
+      if (row.gt.1) then
+              north = id -1
+      else
+              north = -1
+      end if
+
+      if (row.lt.xdim) then
+              south = id + 1
+      else
+              south = -1
+      end if
+
+      if (col.gt.1) then
+              west = id- xdim
+      else
+              west = -1
+      end if
+
+      if (col.lt.ydim) then
+              east = id + xdim
+      else 
+              east = -1
+      end if
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/nodedim.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/nodedim.f
new file mode 100644
index 0000000..f4def3a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/nodedim.f
@@ -0,0 +1,36 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      integer function nodedim(num)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c  compute the exponent where num = 2**nodedim
+c  NOTE: assumes a power-of-two number of nodes
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer num
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      double precision fnum
+
+
+      fnum = dble(num)
+      nodedim = log(fnum)/log(2.0d+0) + 0.00001
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/pintgr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/pintgr.f
new file mode 100644
index 0000000..de514cc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/pintgr.f
@@ -0,0 +1,288 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine pintgr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k
+      integer ibeg, ifin, ifin1
+      integer jbeg, jfin, jfin1
+      integer iglob, iglob1, iglob2
+      integer jglob, jglob1, jglob2
+      integer ind1, ind2
+      double precision  phi1(0:isiz2+1,0:isiz3+1),
+     >                  phi2(0:isiz2+1,0:isiz3+1)
+      double precision  frc1, frc2, frc3
+      double precision  dummy
+
+      integer IERROR
+
+
+c---------------------------------------------------------------------
+c   set up the sub-domains for integeration in each processor
+c---------------------------------------------------------------------
+      ibeg = nx + 1
+      ifin = 0
+      iglob1 = ipt + 1
+      iglob2 = ipt + nx
+      if (iglob1.ge.ii1.and.iglob2.lt.ii2+nx) ibeg = 1
+      if (iglob1.gt.ii1-nx.and.iglob2.le.ii2) ifin = nx
+      if (ii1.ge.iglob1.and.ii1.le.iglob2) ibeg = ii1 - ipt
+      if (ii2.ge.iglob1.and.ii2.le.iglob2) ifin = ii2 - ipt
+      jbeg = ny + 1
+      jfin = 0
+      jglob1 = jpt + 1
+      jglob2 = jpt + ny
+      if (jglob1.ge.ji1.and.jglob2.lt.ji2+ny) jbeg = 1
+      if (jglob1.gt.ji1-ny.and.jglob2.le.ji2) jfin = ny
+      if (ji1.ge.jglob1.and.ji1.le.jglob2) jbeg = ji1 - jpt
+      if (ji2.ge.jglob1.and.ji2.le.jglob2) jfin = ji2 - jpt
+      ifin1 = ifin
+      jfin1 = jfin
+      if (ipt + ifin1.eq.ii2) ifin1 = ifin -1
+      if (jpt + jfin1.eq.ji2) jfin1 = jfin -1
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+
+      do j = jbeg,jfin
+         jglob = jpt + j
+         do i = ibeg,ifin
+            iglob = ipt + i
+
+            k = ki1
+
+            phi1(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+
+            k = ki2
+
+            phi2(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c  communicate in i and j directions
+c---------------------------------------------------------------------
+      call exchange_4(phi1,phi2,ibeg,ifin1,jbeg,jfin1)
+
+      frc1 = 0.0d+00
+
+      do j = jbeg,jfin1
+         do i = ibeg, ifin1
+            frc1 = frc1 + (  phi1(i,j)
+     >                     + phi1(i+1,j)
+     >                     + phi1(i,j+1)
+     >                     + phi1(i+1,j+1)
+     >                     + phi2(i,j)
+     >                     + phi2(i+1,j)
+     >                     + phi2(i,j+1)
+     >                     + phi2(i+1,j+1) )
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c  compute the global sum of individual contributions to frc1
+c---------------------------------------------------------------------
+      dummy = frc1
+      call MPI_ALLREDUCE( dummy,
+     >                    frc1,
+     >                    1,
+     >                    dp_type,
+     >                    MPI_SUM,
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+
+      frc1 = dxi * deta * frc1
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      jglob = jpt + jbeg
+      ind1 = 0
+      if (jglob.eq.ji1) then
+        ind1 = 1
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              iglob = ipt + i
+              phi1(i,k) = c2*(  u(5,i,jbeg,k)
+     >             - 0.50d+00 * (  u(2,i,jbeg,k) ** 2
+     >                           + u(3,i,jbeg,k) ** 2
+     >                           + u(4,i,jbeg,k) ** 2 )
+     >                          / u(1,i,jbeg,k) )
+           end do
+        end do
+      end if
+
+      jglob = jpt + jfin
+      ind2 = 0
+      if (jglob.eq.ji2) then
+        ind2 = 1
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              iglob = ipt + i
+              phi2(i,k) = c2*(  u(5,i,jfin,k)
+     >             - 0.50d+00 * (  u(2,i,jfin,k) ** 2
+     >                           + u(3,i,jfin,k) ** 2
+     >                           + u(4,i,jfin,k) ** 2 )
+     >                          / u(1,i,jfin,k) )
+           end do
+        end do
+      end if
+
+c---------------------------------------------------------------------
+c  communicate in i direction
+c---------------------------------------------------------------------
+      if (ind1.eq.1) then
+        call exchange_5(phi1,ibeg,ifin1)
+      end if
+      if (ind2.eq.1) then
+        call exchange_5 (phi2,ibeg,ifin1)
+      end if
+
+      frc2 = 0.0d+00
+      do k = ki1, ki2-1
+         do i = ibeg, ifin1
+            frc2 = frc2 + (  phi1(i,k)
+     >                     + phi1(i+1,k)
+     >                     + phi1(i,k+1)
+     >                     + phi1(i+1,k+1)
+     >                     + phi2(i,k)
+     >                     + phi2(i+1,k)
+     >                     + phi2(i,k+1)
+     >                     + phi2(i+1,k+1) )
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c  compute the global sum of individual contributions to frc2
+c---------------------------------------------------------------------
+      dummy = frc2
+      call MPI_ALLREDUCE( dummy,
+     >                    frc2,
+     >                    1,
+     >                    dp_type,
+     >                    MPI_SUM,
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+
+      frc2 = dxi * dzeta * frc2
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      iglob = ipt + ibeg
+      ind1 = 0
+      if (iglob.eq.ii1) then
+        ind1 = 1
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              jglob = jpt + j
+              phi1(j,k) = c2*(  u(5,ibeg,j,k)
+     >             - 0.50d+00 * (  u(2,ibeg,j,k) ** 2
+     >                           + u(3,ibeg,j,k) ** 2
+     >                           + u(4,ibeg,j,k) ** 2 )
+     >                          / u(1,ibeg,j,k) )
+           end do
+        end do
+      end if
+
+      iglob = ipt + ifin
+      ind2 = 0
+      if (iglob.eq.ii2) then
+        ind2 = 1
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              jglob = jpt + j
+              phi2(j,k) = c2*(  u(5,ifin,j,k)
+     >             - 0.50d+00 * (  u(2,ifin,j,k) ** 2
+     >                           + u(3,ifin,j,k) ** 2
+     >                           + u(4,ifin,j,k) ** 2 )
+     >                          / u(1,ifin,j,k) )
+           end do
+        end do
+      end if
+
+c---------------------------------------------------------------------
+c  communicate in j direction
+c---------------------------------------------------------------------
+      if (ind1.eq.1) then
+        call exchange_6(phi1,jbeg,jfin1)
+      end if
+      if (ind2.eq.1) then
+        call exchange_6(phi2,jbeg,jfin1)
+      end if
+
+      frc3 = 0.0d+00
+
+      do k = ki1, ki2-1
+         do j = jbeg, jfin1
+            frc3 = frc3 + (  phi1(j,k)
+     >                     + phi1(j+1,k)
+     >                     + phi1(j,k+1)
+     >                     + phi1(j+1,k+1)
+     >                     + phi2(j,k)
+     >                     + phi2(j+1,k)
+     >                     + phi2(j,k+1)
+     >                     + phi2(j+1,k+1) )
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c  compute the global sum of individual contributions to frc3
+c---------------------------------------------------------------------
+      dummy = frc3
+      call MPI_ALLREDUCE( dummy,
+     >                    frc3,
+     >                    1,
+     >                    dp_type,
+     >                    MPI_SUM,
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+
+      frc3 = deta * dzeta * frc3
+      frc = 0.25d+00 * ( frc1 + frc2 + frc3 )
+c      if (id.eq.0) write (*,1001) frc
+
+      return
+
+ 1001 format (//5x,'surface integral = ',1pe12.5//)
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/proc_grid.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/proc_grid.f
new file mode 100644
index 0000000..d0f5037
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/proc_grid.f
@@ -0,0 +1,55 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine proc_grid
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer xdim0, ydim0, IERROR
+
+c---------------------------------------------------------------------
+c
+c   set up a two-d grid for processors: column-major ordering of unknowns
+c
+c---------------------------------------------------------------------
+
+      xdim0  = nnodes_xdim
+      ydim0  = nnodes_compiled/xdim0
+
+      ydim   = dsqrt(dble(num))+0.001d0
+      xdim   = num/ydim
+      do while (ydim .ge. ydim0 .and. xdim*ydim .ne. num)
+         ydim = ydim - 1
+         xdim = num/ydim
+      end do
+
+      if (xdim .lt. xdim0 .or. ydim .lt. ydim0 .or. 
+     &    xdim*ydim .ne. num) then
+         if (id .eq. 0) write(*,2000) num
+2000     format(' Error: couldn''t determine proper proc_grid',
+     &          ' for nprocs=', i6)
+         CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+      endif
+
+      if (id .eq. 0 .and. num .ne. 2**ndim)
+     &   write(*,2100) num, xdim, ydim
+2100  format(' Proc_grid for nprocs =',i6,':',i5,' x',i5)
+
+      row    = mod(id,xdim) + 1
+      col    = id/xdim + 1
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/read_input.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/read_input.f
new file mode 100644
index 0000000..cda98bf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/read_input.f
@@ -0,0 +1,134 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine read_input
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+      integer IERROR, fstatus, nnodes
+
+
+c---------------------------------------------------------------------
+c    only root reads the input file
+c    if input file does not exist, it uses defaults
+c       ipr = 1 for detailed progress output
+c       inorm = how often the norm is printed (once every inorm iterations)
+c       itmax = number of pseudo time steps
+c       dt = time step
+c       omega 1 over-relaxation factor for SSOR
+c       tolrsd = steady state residual tolerance levels
+c       nx, ny, nz = number of grid points in x, y, z directions
+c---------------------------------------------------------------------
+      ROOT = 0
+      if (id .eq. ROOT) then
+
+         write(*, 1000)
+
+         open (unit=3,file='timer.flag',status='old',iostat=fstatus)
+         timeron = .false.
+         if (fstatus .eq. 0) then
+            timeron = .true.
+            close(3)
+         endif
+
+         open (unit=3,file='inputlu.data',status='old',
+     >         access='sequential',form='formatted', iostat=fstatus)
+         if (fstatus .eq. 0) then
+
+            write(*, *) 'Reading from input file inputlu.data'
+
+            read (3,*)
+            read (3,*)
+            read (3,*) ipr, inorm
+            read (3,*)
+            read (3,*)
+            read (3,*) itmax
+            read (3,*)
+            read (3,*)
+            read (3,*) dt
+            read (3,*)
+            read (3,*)
+            read (3,*) omega
+            read (3,*)
+            read (3,*)
+            read (3,*) tolrsd(1),tolrsd(2),tolrsd(3),tolrsd(4),tolrsd(5)
+            read (3,*)
+            read (3,*)
+            read (3,*) nx0, ny0, nz0
+            close(3)
+         else
+            ipr = ipr_default
+            inorm = inorm_default
+            itmax = itmax_default
+            dt = dt_default
+            omega = omega_default
+            tolrsd(1) = tolrsd1_def
+            tolrsd(2) = tolrsd2_def
+            tolrsd(3) = tolrsd3_def
+            tolrsd(4) = tolrsd4_def
+            tolrsd(5) = tolrsd5_def
+            nx0 = isiz01
+            ny0 = isiz02
+            nz0 = isiz03
+         endif
+
+c---------------------------------------------------------------------
+c   check problem size
+c---------------------------------------------------------------------
+         call MPI_COMM_SIZE(MPI_COMM_WORLD, nnodes, ierror)
+         if (nnodes .ne. nnodes_compiled) then
+            write (*, 2000) nnodes, nnodes_compiled
+ 2000       format (5x,'Warning: program is running on',i5,' processors'
+     >             /5x,'but was compiled for ', i5)
+         endif
+
+         if ( ( nx0 .lt. 4 ) .or.
+     >        ( ny0 .lt. 4 ) .or.
+     >        ( nz0 .lt. 4 ) ) then
+
+            write (*,2001)
+ 2001       format (5x,'PROBLEM SIZE IS TOO SMALL - ',
+     >           /5x,'SET EACH OF NX, NY AND NZ AT LEAST EQUAL TO 5')
+            CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+
+         end if
+
+         if ( ( nx0 .gt. isiz01 ) .or.
+     >        ( ny0 .gt. isiz02 ) .or.
+     >        ( nz0 .gt. isiz03 ) ) then
+
+            write (*,2002)
+ 2002       format (5x,'PROBLEM SIZE IS TOO LARGE - ',
+     >           /5x,'NX, NY AND NZ SHOULD BE LESS THAN OR EQUAL TO ',
+     >           /5x,'ISIZ01, ISIZ02 AND ISIZ03 RESPECTIVELY')
+            CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+
+         end if
+
+
+         write(*, 1001) nx0, ny0, nz0
+         write(*, 1002) itmax
+         write(*, 1003) nnodes
+
+ 1000 format(//, ' NAS Parallel Benchmarks 3.3 -- LU Benchmark',/)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations: ', i4)
+ 1003    format(' Number of processes: ', i5, /)
+         
+
+
+      end if
+
+      call bcast_inputs
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/rhs.f
new file mode 100644
index 0000000..e32df4d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/rhs.f
@@ -0,0 +1,511 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand sides
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iex
+      integer L1, L2
+      integer ist1, iend1
+      integer jst1, jend1
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+      if (timeron) call timer_start(t_rhs)
+
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   iex = flag : iex = 0  north/south communication
+c              : iex = 1  east/west communication
+c---------------------------------------------------------------------
+      iex   = 0
+
+c---------------------------------------------------------------------
+c   communicate and receive/send two rows of data
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_exch)
+      call exchange_3(u,iex)
+      if (timeron) call timer_stop(t_exch)
+
+      L1 = 0
+      if (north.eq.-1) L1 = 1
+      L2 = nx + 1
+      if (south.eq.-1) L2 = nx
+
+      ist1 = 1
+      iend1 = nx
+      if (north.eq.-1) ist1 = 4
+      if (south.eq.-1) iend1 = nx - 3
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = L1, L2
+               flux(1,i,j,k) = u(2,i,j,k)
+               u21 = u(2,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u21 + c2 * 
+     >                        ( u(5,i,j,k) - q )
+               flux(3,i,j,k) = u(3,i,j,k) * u21
+               flux(4,i,j,k) = u(4,i,j,k) * u21
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                 - tx2 * ( flux(m,i+1,j,k) - flux(m,i-1,j,k) )
+               end do
+            end do
+
+            do i = ist, L2
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i,j,k) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i,j,k) = tx3 * ( u31i - u31im1 )
+               flux(4,i,j,k) = tx3 * ( u41i - u41im1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dx1 * tx1 * (            u(1,i-1,j,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(2,i+1,j,k) - flux(2,i,j,k) )
+     >              + dx2 * tx1 * (            u(2,i-1,j,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(3,i+1,j,k) - flux(3,i,j,k) )
+     >              + dx3 * tx1 * (            u(3,i-1,j,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(4,i+1,j,k) - flux(4,i,j,k) )
+     >              + dx4 * tx1 * (            u(4,i-1,j,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(5,i+1,j,k) - flux(5,i,j,k) )
+     >              + dx5 * tx1 * (            u(5,i-1,j,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            IF (north.eq.-1) then
+             do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * u(m,2,j,k)
+     >                      - 4.0d+00 * u(m,3,j,k)
+     >                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * u(m,2,j,k)
+     >                      + 6.0d+00 * u(m,3,j,k)
+     >                      - 4.0d+00 * u(m,4,j,k)
+     >                      +           u(m,5,j,k) )
+             end do
+            END IF
+
+            do i = ist1,iend1
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i-2,j,k)
+     >                        - 4.0d+00 * u(m,i-1,j,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i+1,j,k)
+     >                        +           u(m,i+2,j,k) )
+               end do
+            end do
+
+            IF (south.eq.-1) then
+             do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)
+     >           - dssp * (             u(m,nx-4,j,k)
+     >                      - 4.0d+00 * u(m,nx-3,j,k)
+     >                      + 6.0d+00 * u(m,nx-2,j,k)
+     >                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)
+     >           - dssp * (             u(m,nx-3,j,k)
+     >                      - 4.0d+00 * u(m,nx-2,j,k)
+     >                      + 5.0d+00 * u(m,nx-1,j,k) )
+             end do
+            END IF
+
+         end do
+      end do 
+
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   iex = flag : iex = 0  north/south communication
+c---------------------------------------------------------------------
+      iex   = 1
+
+c---------------------------------------------------------------------
+c   communicate and receive/send two rows of data
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_exch)
+      call exchange_3(u,iex)
+      if (timeron) call timer_stop(t_exch)
+
+      L1 = 0
+      if (west.eq.-1) L1 = 1
+      L2 = ny + 1
+      if (east.eq.-1) L2 = ny
+
+      jst1 = 1
+      jend1 = ny
+      if (west.eq.-1) jst1 = 4
+      if (east.eq.-1) jend1 = ny - 3
+
+      do k = 2, nz - 1
+         do j = L1, L2
+            do i = ist, iend
+               flux(1,i,j,k) = u(3,i,j,k)
+               u31 = u(3,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u31 
+               flux(3,i,j,k) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,i,j,k) = u(4,i,j,k) * u31
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                   - ty2 * ( flux(m,i,j+1,k) - flux(m,i,j-1,k) )
+               end do
+            end do
+         end do
+
+         do j = jst, L2
+            do i = ist, iend
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,i,j,k) = ty3 * ( u21j - u21jm1 )
+               flux(3,i,j,k) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,i,j,k) = ty3 * ( u41j - u41jm1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dy1 * ty1 * (            u(1,i,j-1,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,i,j+1,k) - flux(2,i,j,k) )
+     >              + dy2 * ty1 * (            u(2,i,j-1,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,i,j+1,k) - flux(3,i,j,k) )
+     >              + dy3 * ty1 * (            u(3,i,j-1,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,i,j+1,k) - flux(4,i,j,k) )
+     >              + dy4 * ty1 * (            u(4,i,j-1,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,i,j+1,k) - flux(5,i,j,k) )
+     >              + dy5 * ty1 * (            u(5,i,j-1,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         IF (west.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * u(m,i,2,k)
+     >                      - 4.0d+00 * u(m,i,3,k)
+     >                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * u(m,i,2,k)
+     >                      + 6.0d+00 * u(m,i,3,k)
+     >                      - 4.0d+00 * u(m,i,4,k)
+     >                      +           u(m,i,5,k) )
+             end do
+            end do
+         END IF
+
+         do j = jst1, jend1
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j-2,k)
+     >                        - 4.0d+00 * u(m,i,j-1,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j+1,k)
+     >                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         IF (east.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)
+     >           - dssp * (             u(m,i,ny-4,k)
+     >                      - 4.0d+00 * u(m,i,ny-3,k)
+     >                      + 6.0d+00 * u(m,i,ny-2,k)
+     >                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)
+     >           - dssp * (             u(m,i,ny-3,k)
+     >                      - 4.0d+00 * u(m,i,ny-2,k)
+     >                      + 5.0d+00 * u(m,i,ny-1,k) )
+             end do
+            end do
+         END IF
+
+      end do
+
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+      do k = 1, nz
+         do j = jst, jend
+            do i = ist, iend
+               flux(1,i,j,k) = u(4,i,j,k)
+               u41 = u(4,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u41 
+               flux(3,i,j,k) = u(3,i,j,k) * u41 
+               flux(4,i,j,k) = u(4,i,j,k) * u41 + c2 * (u(5,i,j,k)-q)
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u41
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                - tz2 * ( flux(m,i,j,k+1) - flux(m,i,j,k-1) )
+               end do
+            end do
+         end do
+      end do
+
+      do k = 2, nz
+         do j = jst, jend
+            do i = ist, iend
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21k = tmp * u(2,i,j,k)
+               u31k = tmp * u(3,i,j,k)
+               u41k = tmp * u(4,i,j,k)
+               u51k = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i,j,k-1)
+
+               u21km1 = tmp * u(2,i,j,k-1)
+               u31km1 = tmp * u(3,i,j,k-1)
+               u41km1 = tmp * u(4,i,j,k-1)
+               u51km1 = tmp * u(5,i,j,k-1)
+
+               flux(2,i,j,k) = tz3 * ( u21k - u21km1 )
+               flux(3,i,j,k) = tz3 * ( u31k - u31km1 )
+               flux(4,i,j,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dz1 * tz1 * (            u(1,i,j,k-1)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j,k+1) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,i,j,k+1) - flux(2,i,j,k) )
+     >              + dz2 * tz1 * (            u(2,i,j,k-1)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j,k+1) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,i,j,k+1) - flux(3,i,j,k) )
+     >              + dz3 * tz1 * (            u(3,i,j,k-1)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j,k+1) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,i,j,k+1) - flux(4,i,j,k) )
+     >              + dz4 * tz1 * (            u(4,i,j,k-1)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j,k+1) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,i,j,k+1) - flux(5,i,j,k) )
+     >              + dz5 * tz1 * (            u(5,i,j,k-1)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j,k+1) )
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,2) = rsd(m,i,j,2)
+     >           - dssp * ( + 5.0d+00 * u(m,i,j,2)
+     >                      - 4.0d+00 * u(m,i,j,3)
+     >                      +           u(m,i,j,4) )
+               rsd(m,i,j,3) = rsd(m,i,j,3)
+     >           - dssp * ( - 4.0d+00 * u(m,i,j,2)
+     >                      + 6.0d+00 * u(m,i,j,3)
+     >                      - 4.0d+00 * u(m,i,j,4)
+     >                      +           u(m,i,j,5) )
+            end do
+         end do
+      end do
+
+      do k = 4, nz - 3
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j,k-2)
+     >                        - 4.0d+00 * u(m,i,j,k-1)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j,k+1)
+     >                        +           u(m,i,j,k+2) )
+               end do
+            end do
+         end do
+      end do
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rsd(m,i,j,nz-2)
+     >           - dssp * (             u(m,i,j,nz-4)
+     >                      - 4.0d+00 * u(m,i,j,nz-3)
+     >                      + 6.0d+00 * u(m,i,j,nz-2)
+     >                      - 4.0d+00 * u(m,i,j,nz-1)  )
+               rsd(m,i,j,nz-1) = rsd(m,i,j,nz-1)
+     >           - dssp * (             u(m,i,j,nz-3)
+     >                      - 4.0d+00 * u(m,i,j,nz-2)
+     >                      + 5.0d+00 * u(m,i,j,nz-1) )
+            end do
+         end do
+      end do
+
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setbv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setbv.f
new file mode 100644
index 0000000..56b0edf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setbv.f
@@ -0,0 +1,79 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setbv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   set the boundary values of dependent variables
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c   local variables
+c---------------------------------------------------------------------
+      integer i, j, k
+      integer iglob, jglob
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along the top and bottom faces
+c---------------------------------------------------------------------
+      do j = 1, ny
+         jglob = jpt + j
+         do i = 1, nx
+           iglob = ipt + i
+            call exact( iglob, jglob, 1, u( 1, i, j, 1 ) )
+            call exact( iglob, jglob, nz, u( 1, i, j, nz ) )
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along north and south faces
+c---------------------------------------------------------------------
+      IF (west.eq.-1) then
+         do k = 1, nz
+            do i = 1, nx
+               iglob = ipt + i
+               call exact( iglob, 1, k, u( 1, i, 1, k ) )
+            end do
+         end do
+      END IF
+
+      IF (east.eq.-1) then
+          do k = 1, nz
+             do i = 1, nx
+                iglob = ipt + i
+                call exact( iglob, ny0, k, u( 1, i, ny, k ) )
+             end do
+          end do
+      END IF
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along east and west faces
+c---------------------------------------------------------------------
+      IF (north.eq.-1) then
+         do k = 1, nz
+            do j = 1, ny
+               jglob = jpt + j
+               call exact( 1, jglob, k, u( 1, 1, j, k ) )
+            end do
+         end do
+      END IF
+
+      IF (south.eq.-1) then
+         do k = 1, nz
+            do j = 1, ny
+                  jglob = jpt + j
+            call exact( nx0, jglob, k, u( 1, nx, j, k ) )
+            end do
+         end do
+      END IF
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setcoeff.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setcoeff.f
new file mode 100644
index 0000000..8fc5c18
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setcoeff.f
@@ -0,0 +1,159 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setcoeff
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      dxi = 1.0d+00 / ( nx0 - 1 )
+      deta = 1.0d+00 / ( ny0 - 1 )
+      dzeta = 1.0d+00 / ( nz0 - 1 )
+
+      tx1 = 1.0d+00 / ( dxi * dxi )
+      tx2 = 1.0d+00 / ( 2.0d+00 * dxi )
+      tx3 = 1.0d+00 / dxi
+
+      ty1 = 1.0d+00 / ( deta * deta )
+      ty2 = 1.0d+00 / ( 2.0d+00 * deta )
+      ty3 = 1.0d+00 / deta
+
+      tz1 = 1.0d+00 / ( dzeta * dzeta )
+      tz2 = 1.0d+00 / ( 2.0d+00 * dzeta )
+      tz3 = 1.0d+00 / dzeta
+
+      ii1 = 2
+      ii2 = nx0 - 1
+      ji1 = 2
+      ji2 = ny0 - 2
+      ki1 = 3
+      ki2 = nz0 - 1
+
+c---------------------------------------------------------------------
+c   diffusion coefficients
+c---------------------------------------------------------------------
+      dx1 = 0.75d+00
+      dx2 = dx1
+      dx3 = dx1
+      dx4 = dx1
+      dx5 = dx1
+
+      dy1 = 0.75d+00
+      dy2 = dy1
+      dy3 = dy1
+      dy4 = dy1
+      dy5 = dy1
+
+      dz1 = 1.00d+00
+      dz2 = dz1
+      dz3 = dz1
+      dz4 = dz1
+      dz5 = dz1
+
+c---------------------------------------------------------------------
+c   fourth difference dissipation
+c---------------------------------------------------------------------
+      dssp = ( max (dx1, dy1, dz1 ) ) / 4.0d+00
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the first pde
+c---------------------------------------------------------------------
+      ce(1,1) = 2.0d+00
+      ce(1,2) = 0.0d+00
+      ce(1,3) = 0.0d+00
+      ce(1,4) = 4.0d+00
+      ce(1,5) = 5.0d+00
+      ce(1,6) = 3.0d+00
+      ce(1,7) = 5.0d-01
+      ce(1,8) = 2.0d-02
+      ce(1,9) = 1.0d-02
+      ce(1,10) = 3.0d-02
+      ce(1,11) = 5.0d-01
+      ce(1,12) = 4.0d-01
+      ce(1,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the second pde
+c---------------------------------------------------------------------
+      ce(2,1) = 1.0d+00
+      ce(2,2) = 0.0d+00
+      ce(2,3) = 0.0d+00
+      ce(2,4) = 0.0d+00
+      ce(2,5) = 1.0d+00
+      ce(2,6) = 2.0d+00
+      ce(2,7) = 3.0d+00
+      ce(2,8) = 1.0d-02
+      ce(2,9) = 3.0d-02
+      ce(2,10) = 2.0d-02
+      ce(2,11) = 4.0d-01
+      ce(2,12) = 3.0d-01
+      ce(2,13) = 5.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the third pde
+c---------------------------------------------------------------------
+      ce(3,1) = 2.0d+00
+      ce(3,2) = 2.0d+00
+      ce(3,3) = 0.0d+00
+      ce(3,4) = 0.0d+00
+      ce(3,5) = 0.0d+00
+      ce(3,6) = 2.0d+00
+      ce(3,7) = 3.0d+00
+      ce(3,8) = 4.0d-02
+      ce(3,9) = 3.0d-02
+      ce(3,10) = 5.0d-02
+      ce(3,11) = 3.0d-01
+      ce(3,12) = 5.0d-01
+      ce(3,13) = 4.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fourth pde
+c---------------------------------------------------------------------
+      ce(4,1) = 2.0d+00
+      ce(4,2) = 2.0d+00
+      ce(4,3) = 0.0d+00
+      ce(4,4) = 0.0d+00
+      ce(4,5) = 0.0d+00
+      ce(4,6) = 2.0d+00
+      ce(4,7) = 3.0d+00
+      ce(4,8) = 3.0d-02
+      ce(4,9) = 5.0d-02
+      ce(4,10) = 4.0d-02
+      ce(4,11) = 2.0d-01
+      ce(4,12) = 1.0d-01
+      ce(4,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fifth pde
+c---------------------------------------------------------------------
+      ce(5,1) = 5.0d+00
+      ce(5,2) = 4.0d+00
+      ce(5,3) = 3.0d+00
+      ce(5,4) = 2.0d+00
+      ce(5,5) = 1.0d-01
+      ce(5,6) = 4.0d-01
+      ce(5,7) = 3.0d-01
+      ce(5,8) = 5.0d-02
+      ce(5,9) = 4.0d-02
+      ce(5,10) = 3.0d-02
+      ce(5,11) = 1.0d-01
+      ce(5,12) = 3.0d-01
+      ce(5,13) = 2.0d-01
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setiv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setiv.f
new file mode 100644
index 0000000..73725cb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/setiv.f
@@ -0,0 +1,67 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine setiv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   set the initial values of independent variables based on tri-linear
+c   interpolation of boundary values in the computational space.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      double precision  xi, eta, zeta
+      double precision  pxi, peta, pzeta
+      double precision  ue_1jk(5),ue_nx0jk(5),ue_i1k(5),
+     >        ue_iny0k(5),ue_ij1(5),ue_ijnz(5)
+
+
+      do k = 2, nz - 1
+         zeta = ( dble (k-1) ) / (nz-1)
+         do j = 1, ny
+          jglob = jpt + j
+          IF (jglob.ne.1.and.jglob.ne.ny0) then
+            eta = ( dble (jglob-1) ) / (ny0-1)
+            do i = 1, nx
+              iglob = ipt + i
+              IF (iglob.ne.1.and.iglob.ne.nx0) then
+               xi = ( dble (iglob-1) ) / (nx0-1)
+               call exact (1,jglob,k,ue_1jk)
+               call exact (nx0,jglob,k,ue_nx0jk)
+               call exact (iglob,1,k,ue_i1k)
+               call exact (iglob,ny0,k,ue_iny0k)
+               call exact (iglob,jglob,1,ue_ij1)
+               call exact (iglob,jglob,nz,ue_ijnz)
+               do m = 1, 5
+                  pxi =   ( 1.0d+00 - xi ) * ue_1jk(m)
+     >                              + xi   * ue_nx0jk(m)
+                  peta =  ( 1.0d+00 - eta ) * ue_i1k(m)
+     >                              + eta   * ue_iny0k(m)
+                  pzeta = ( 1.0d+00 - zeta ) * ue_ij1(m)
+     >                              + zeta   * ue_ijnz(m)
+
+                  u( m, i, j, k ) = pxi + peta + pzeta
+     >                 - pxi * peta - peta * pzeta - pzeta * pxi
+     >                 + pxi * peta * pzeta
+
+               end do
+              END IF
+            end do
+          END IF
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/ssor.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/ssor.f
new file mode 100644
index 0000000..5eaa936
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/ssor.f
@@ -0,0 +1,246 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to perform pseudo-time stepping SSOR iterations
+c   for five nonlinear pde's.
+c---------------------------------------------------------------------
+
+      implicit none
+      integer  niter
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      integer istep
+      double precision  tmp
+      double precision  delunm(5), tv(5,isiz1,isiz2)
+
+      external timer_read
+      double precision wtime, timer_read
+
+      integer IERROR
+
+ 
+      ROOT = 0
+ 
+c---------------------------------------------------------------------
+c   begin pseudo-time stepping iterations
+c---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+c---------------------------------------------------------------------
+c   initialize a,b,c,d to zero (guarantees that page tables have been
+c   formed, if applicable on given architecture, before timestepping).
+c---------------------------------------------------------------------
+      do m=1,isiz2
+         do k=1,isiz1
+            do j=1,5
+               do i=1,5
+                  a(i,j,k,m) = 0.d0
+                  b(i,j,k,m) = 0.d0
+                  c(i,j,k,m) = 0.d0
+                  d(i,j,k,m) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+      call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the L2 norms of newton iteration residuals
+c---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >             ist, iend, jst, jend,
+     >             rsd, rsdnm )
+  
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+      call MPI_BARRIER( MPI_COMM_WORLD, IERROR )
+ 
+      call timer_clear(1)
+      call timer_start(1)
+
+c---------------------------------------------------------------------
+c   the timestep loop
+c---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (id .eq. 0) then
+            if (mod ( istep, 20) .eq. 0 .or.
+     >            istep .eq. itmax .or.
+     >            istep .eq. 1) then
+               if (niter .gt. 1) write( *, 200) istep
+ 200           format(' Time step ', i4)
+            endif
+         endif
+ 
+c---------------------------------------------------------------------
+c   perform SSOR iteration
+c---------------------------------------------------------------------
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = dt * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+ 
+         DO k = 2, nz -1 
+c---------------------------------------------------------------------
+c   form the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            call jacld(k)
+ 
+c---------------------------------------------------------------------
+c   perform the lower triangular solution
+c---------------------------------------------------------------------
+            call blts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd,
+     >                 a, b, c, d,
+     >                 ist, iend, jst, jend, 
+     >                 nx0, ny0, ipt, jpt)
+          END DO
+ 
+          DO k = nz - 1, 2, -1
+c---------------------------------------------------------------------
+c   form the strictly upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            call jacu(k)
+
+c---------------------------------------------------------------------
+c   perform the upper triangular solution
+c---------------------------------------------------------------------
+            call buts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, tv,
+     >                 d, a, b, c,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0, ipt, jpt)
+          END DO
+ 
+c---------------------------------------------------------------------
+c   update the variables
+c---------------------------------------------------------------------
+ 
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )
+     >                    + tmp * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration corrections
+c---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, delunm )
+c            if ( ipr .eq. 1 .and. id .eq. 0 ) then
+c                write (*,1006) ( delunm(m), m = 1, 5 )
+c            else if ( ipr .eq. 2 .and. id .eq. 0 ) then
+c                write (*,'(i5,f15.6)') istep,delunm(5)
+c            end if
+         end if
+ 
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+         call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration residuals
+c---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.
+     >        ( istep .eq. itmax ) ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, rsdnm )
+c            if ( ipr .eq. 1.and.id.eq.0 ) then
+c                write (*,1007) ( rsdnm(m), m = 1, 5 )
+c            end if
+         end if
+
+c---------------------------------------------------------------------
+c   check the newton-iteration residuals against the tolerance levels
+c---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.
+     >        ( rsdnm(2) .lt. tolrsd(2) ) .and.
+     >        ( rsdnm(3) .lt. tolrsd(3) ) .and.
+     >        ( rsdnm(4) .lt. tolrsd(4) ) .and.
+     >        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+            if (id.eq.0) then
+               write (*,1004) istep
+            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      wtime = timer_read(1)
+ 
+
+      call MPI_ALLREDUCE( wtime, 
+     >                    maxtime, 
+     >                    1, 
+     >                    MPI_DOUBLE_PRECISION, 
+     >                    MPI_MAX, 
+     >                    MPI_COMM_WORLD,
+     >                    IERROR )
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,
+     >   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/subdomain.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/subdomain.f
new file mode 100644
index 0000000..ab4f773
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/subdomain.f
@@ -0,0 +1,105 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine subdomain
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer mm, ierror, errorcode
+
+
+c---------------------------------------------------------------------
+c
+c   set up the sub-domain sizes
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   x dimension
+c---------------------------------------------------------------------
+      mm   = mod(nx0,xdim)
+      if (row.le.mm) then
+        nx = nx0/xdim + 1
+        ipt = (row-1)*nx
+      else
+        nx = nx0/xdim
+        ipt = (row-1)*nx + mm
+      end if
+
+c---------------------------------------------------------------------
+c   y dimension
+c---------------------------------------------------------------------
+      mm   = mod(ny0,ydim)
+      if (col.le.mm) then
+        ny = ny0/ydim + 1
+        jpt = (col-1)*ny
+      else
+        ny = ny0/ydim
+        jpt = (col-1)*ny + mm
+      end if
+
+c---------------------------------------------------------------------
+c   z dimension
+c---------------------------------------------------------------------
+      nz = nz0
+
+c---------------------------------------------------------------------
+c   check the sub-domain size
+c---------------------------------------------------------------------
+      if ( ( nx .lt. 3 ) .or.
+     >     ( ny .lt. 3 ) .or.
+     >     ( nz .lt. 3 ) ) then
+         write (*,2001) nx, ny, nz
+ 2001    format (5x,'SUBDOMAIN SIZE IS TOO SMALL - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE GREATER THAN OR EQUAL',
+     >        /5x,'TO 3 THEY ARE CURRENTLY', 3I5)
+          ERRORCODE = 1
+          CALL MPI_ABORT( MPI_COMM_WORLD,
+     >                    ERRORCODE,
+     >                    IERROR )
+      end if
+
+      if ( ( nx .gt. isiz1 ) .or.
+     >     ( ny .gt. isiz2 ) .or.
+     >     ( nz .gt. isiz3 ) ) then
+         write (*,2002) nx, ny, nz
+ 2002    format (5x,'SUBDOMAIN SIZE IS TOO LARGE - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE LESS THAN OR EQUAL TO ',
+     >        /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY.  THEY ARE',
+     >        /5x,'CURRENTLY', 3I5)
+          ERRORCODE = 1
+          CALL MPI_ABORT( MPI_COMM_WORLD,
+     >                    ERRORCODE,
+     >                    IERROR )
+      end if
+
+
+c---------------------------------------------------------------------
+c   set up the start and end in i and j extents for all processors
+c---------------------------------------------------------------------
+      ist = 1
+      iend = nx
+      if (north.eq.-1) ist = 2
+      if (south.eq.-1) iend = nx - 1
+
+      jst = 1
+      jend = ny
+      if (west.eq.-1) jst = 2
+      if (east.eq.-1) jend = ny - 1
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/timing.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/timing.h
new file mode 100644
index 0000000..d156da6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/timing.h
@@ -0,0 +1,11 @@
+c---------------------------------------------------------------------
+
+      integer t_total, t_rhs, t_blts, t_buts, t_jacld, t_jacu,
+     >        t_exch, t_lcomm, t_ucomm, t_rcomm, t_last
+      parameter (t_total=1, t_rhs=2, t_blts=3, t_buts=4, t_jacld=5, 
+     >        t_jacu=6, t_exch=7, t_lcomm=8, t_ucomm=9, t_rcomm=10, 
+     >        t_last=10)
+
+      double precision maxtime
+      logical timeron
+      common/timer/maxtime, timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/verify.f
new file mode 100644
index 0000000..2572441
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/LU/verify.f
@@ -0,0 +1,403 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(xcr, xce, xci, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        implicit none
+        include 'mpinpb.h'
+        include 'applu.incl'
+
+        double precision xcr(5), xce(5), xci
+        double precision xcrref(5),xceref(5),xciref, 
+     >                   xcrdif(5),xcedif(5),xcidif,
+     >                   epsilon, dtref
+        integer m
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+        xciref = 1.0
+
+        if ( (nx0  .eq. 12     ) .and. 
+     >       (ny0  .eq. 12     ) .and.
+     >       (nz0  .eq. 12     ) .and.
+     >       (itmax   .eq. 50    ))  then
+
+           class = 'S'
+           dtref = 5.0d-1
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xcrref(1) = 1.6196343210976702d-02
+         xcrref(2) = 2.1976745164821318d-03
+         xcrref(3) = 1.5179927653399185d-03
+         xcrref(4) = 1.5029584435994323d-03
+         xcrref(5) = 3.4264073155896461d-02
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xceref(1) = 6.4223319957960924d-04
+         xceref(2) = 8.4144342047347926d-05
+         xceref(3) = 5.8588269616485186d-05
+         xceref(4) = 5.8474222595157350d-05
+         xceref(5) = 1.3103347914111294d-03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (12X12X12) grid,
+c   after 50 time steps, with DT = 5.0d-01
+c---------------------------------------------------------------------
+         xciref = 7.8418928865937083d+00
+
+
+        elseif ( (nx0 .eq. 33) .and. 
+     >           (ny0 .eq. 33) .and.
+     >           (nz0 .eq. 33) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'W'   !SPEC95fp size
+           dtref = 1.5d-3
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (33x33x33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xcrref(1) =   0.1236511638192d+02
+           xcrref(2) =   0.1317228477799d+01
+           xcrref(3) =   0.2550120713095d+01
+           xcrref(4) =   0.2326187750252d+01
+           xcrref(5) =   0.2826799444189d+02
+
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (33X33X33) grid,
+c---------------------------------------------------------------------
+           xceref(1) =   0.4867877144216d+00
+           xceref(2) =   0.5064652880982d-01
+           xceref(3) =   0.9281818101960d-01
+           xceref(4) =   0.8570126542733d-01
+           xceref(5) =   0.1084277417792d+01
+
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (33X33X33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xciref    =   0.1161399311023d+02
+
+        elseif ( (nx0 .eq. 64) .and. 
+     >           (ny0 .eq. 64) .and.
+     >           (nz0 .eq. 64) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'A'
+           dtref = 2.0d+0
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 7.7902107606689367d+02
+         xcrref(2) = 6.3402765259692870d+01
+         xcrref(3) = 1.9499249727292479d+02
+         xcrref(4) = 1.7845301160418537d+02
+         xcrref(5) = 1.8384760349464247d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.9964085685471943d+01
+         xceref(2) = 2.8194576365003349d+00
+         xceref(3) = 7.3473412698774742d+00
+         xceref(4) = 6.7139225687777051d+00
+         xceref(5) = 7.0715315688392578d+01
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (64X64X64) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 2.6030925604886277d+01
+
+
+        elseif ( (nx0 .eq. 102) .and. 
+     >           (ny0 .eq. 102) .and.
+     >           (nz0 .eq. 102) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'B'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (102X102X102) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 3.5532672969982736d+03
+         xcrref(2) = 2.6214750795310692d+02
+         xcrref(3) = 8.8333721850952190d+02
+         xcrref(4) = 7.7812774739425265d+02
+         xcrref(5) = 7.3087969592545314d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (102X102X102) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 1.1401176380212709d+02
+         xceref(2) = 8.1098963655421574d+00
+         xceref(3) = 2.8480597317698308d+01
+         xceref(4) = 2.5905394567832939d+01
+         xceref(5) = 2.6054907504857413d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (102X102X102) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 4.7887162703308227d+01
+
+        elseif ( (nx0 .eq. 162) .and. 
+     >           (ny0 .eq. 162) .and.
+     >           (nz0 .eq. 162) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'C'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (162X162X162) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 1.03766980323537846d+04
+         xcrref(2) = 8.92212458801008552d+02
+         xcrref(3) = 2.56238814582660871d+03
+         xcrref(4) = 2.19194343857831427d+03
+         xcrref(5) = 1.78078057261061185d+04
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (162X162X162) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.15986399716949279d+02
+         xceref(2) = 1.55789559239863600d+01
+         xceref(3) = 5.41318863077207766d+01
+         xceref(4) = 4.82262643154045421d+01
+         xceref(5) = 4.55902910043250358d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (162X162X162) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+        elseif ( (nx0 .eq. 408) .and. 
+     >           (ny0 .eq. 408) .and.
+     >           (nz0 .eq. 408) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'D'
+           dtref = 1.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (408X408X408) grid,
+c   after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.4868417937025d+05
+         xcrref(2) = 0.4696371050071d+04
+         xcrref(3) = 0.1218114549776d+05 
+         xcrref(4) = 0.1033801493461d+05
+         xcrref(5) = 0.7142398413817d+05
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (408X408X408) 
+c   grid, after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.3752393004482d+03
+         xceref(2) = 0.3084128893659d+02
+         xceref(3) = 0.9434276905469d+02
+         xceref(4) = 0.8230686681928d+02
+         xceref(5) = 0.7002620636210d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (408X408X408) grid,
+c   after 300 time steps, with DT = 1.0d+00
+c---------------------------------------------------------------------
+         xciref =    0.8334101392503d+02
+
+        elseif ( (nx0 .eq. 1020) .and. 
+     >           (ny0 .eq. 1020) .and.
+     >           (nz0 .eq. 1020) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'E'
+           dtref = 0.5d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (1020X1020X1020) grid,
+c   after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2099641687874d+06
+         xcrref(2) = 0.2130403143165d+05
+         xcrref(3) = 0.5319228789371d+05 
+         xcrref(4) = 0.4509761639833d+05
+         xcrref(5) = 0.2932360006590d+06
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (1020X1020X1020) 
+c   grid, after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.4800572578333d+03
+         xceref(2) = 0.4221993400184d+02
+         xceref(3) = 0.1210851906824d+03
+         xceref(4) = 0.1047888986770d+03
+         xceref(5) = 0.8363028257389d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (1020X1020X1020) grid,
+c   after 300 time steps, with DT = 0.5d+00
+c---------------------------------------------------------------------
+         xciref =    0.9512163272273d+02
+
+        else
+           verified = .FALSE.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+        xcidif = dabs((xci - xciref)/xciref)
+
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(/, ' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' Accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, 2x, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, 2x, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, 2x, E20.13)
+        
+        if (class .ne. 'U') then
+           write (*,2025)
+        else
+           write (*,2026)
+        endif
+ 2025   format(' Comparison of surface integral')
+ 2026   format(' Surface integral')
+
+
+        if (class .eq. 'U') then
+           write(*, 2030) xci
+        else if (xcidif .le. epsilon) then
+           write(*, 2032) xci, xciref, xcidif
+        else
+           verified = .false.
+           write(*, 2031) xci, xciref, xcidif
+        endif
+
+ 2030   format('          ', 4x, E20.13)
+ 2031   format(' FAILURE: ', 4x, E20.13, E20.13, E20.13)
+ 2032   format('          ', 4x, E20.13, E20.13, E20.13)
+
+
+
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/Makefile
new file mode 100644
index 0000000..1554bed
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=mg
+BENCHMARKU=MG
+
+include ../config/make.def
+
+OBJS = mg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+mg.o:		mg.f  globals.h mpinpb.h npbparams.h
+	${FCOMPILE} mg.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/README
new file mode 100644
index 0000000..6c03f78
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/README
@@ -0,0 +1,138 @@
+Some info about the MG benchmark
+================================
+    
+'mg_demo' demonstrates the capabilities of a very simple multigrid
+solver in computing a three dimensional potential field.  This is
+a simplified multigrid solver in two important respects:
+
+  (1) it solves only a constant coefficient equation,
+  and that only on a uniform cubical grid,
+    
+  (2) it solves only a single equation, representing
+  a scalar field rather than a vector field.
+
+We chose it for its portability and simplicity, and expect that a
+supercomputer which can run it effectively will also be able to
+run more complex multigrid programs at least as well.
+     
+     Eric Barszcz                         Paul Frederickson
+     RIACS
+     NASA Ames Research Center            NASA Ames Research Center
+
+========================================================================
+Running the program:  (Note: also see parameter lm information in the
+                       two sections immediately below this section)
+
+The program may be run with or without an input deck (called "mg.input"). 
+The following describes a few things about the input deck if you want to 
+use one. 
+
+The four lines below are the "mg.input" file required to run a
+problem of total size 256x256x256, for 4 iterations (Class "A"),
+and presumes the use of 8 processors:
+
+   8 = top level
+   256 256 256 = nx ny nz
+   4 = nit
+   0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+8 processors are solving this problem (recall that the number of 
+processors is specified to MPI as a run parameter, and MPI subsequently
+determines this for the code via an MPI subroutine call), a 2x2x2 
+processor grid is  formed, and thus each partition on a processor is 
+of size 128x128x128.  Therefore, a maximum of 8 multi-grid levels may 
+be used.  These are of size 128,64,32,16,8,4,2,1, with the coarsest 
+level being a single point on a given processor.
+
+
+Next, consider the same size problem but running on 1 processor.  The
+following "mg.input" file is appropriate:
+
+    9 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+Since this processor must solve the full 256x256x256 problem, this
+permits 9 multi-grid levels (256,128,64,32,16,8,4,2,1), resulting in 
+a coarsest multi-grid level of a single point on the processor
+
+
+Next, consider the same size problem but running on 2 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The algorithm for partitioning the full grid onto some power of 2 number 
+of processors is to start by splitting the last dimension of the grid
+(z dimension) in 2: the problem is now partitioned onto 2 processors.
+Next the middle dimension (y dimension) is split in 2: the problem is now
+partitioned onto 4 processors.  Next, first dimension (x dimension) is
+split in 2: the problem is now partitioned onto 8 processors.  Next, the
+last dimension (z dimension) is split again in 2: the problem is now
+partitioned onto 16 processors.  This partitioning is repeated until all 
+of the power of 2 processors have been allocated.
+
+Thus to run the above problem on 2 processors, the grid partitioning 
+algorithm will allocate the two processors across the last dimension, 
+creating two partitions each of size 256x256x128. The coarsest level of 
+multi-grid must be a single point surrounded by a cubic number of grid 
+points.  Therefore, each of the two processor partitions will contain 4 
+coarsest multi-grid level points, each surrounded by a cube of grid points 
+of size 128x128x128, indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 4 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The partitioning algorithm will create 4 partitions, each of size
+256x128x128.  Each partition will contain 2 coarsest multi-grid level
+points each surrounded by a cube of grid points of size 128x128x128, 
+indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 16 processors.  The
+following "mg.input" file is required:
+
+    7 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+On each node a partition of size 128x128x64 will be created.  A maximum
+of 7 multi-grid levels (64,32,16,8,4,2,1) may be used, resulting in each 
+partions containing 4 coarsest multi-grid level points, each surrounded 
+by a cube of grid points of size 64x64x64, indicated by a top level of 7.
+
+
+
+
+Note that non-cubic problem sizes may also be considered:
+
+The four lines below are the "mg.input" file appropriate for running a
+problem of total size 256x512x512, for 20 iterations and presumes the 
+use of 32 processors (note: this is NOT a class C problem):
+
+    8 = top level
+    256 512 512 = nx ny nz
+    20 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+32 processors are solving this problem, a 2x4x4 processor grid is
+formed, and thus each partition on a processor is of size 128x128x128.
+Therefore, a maximum of 8 multi-grid levels may be used.  These are of
+size 128,64,32,16,8,4,2,1, with the coarsest level being a single 
+point on a given processor.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/globals.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/globals.h
new file mode 100644
index 0000000..b0a56b4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/globals.h
@@ -0,0 +1,64 @@
+c---------------------------------------------------------------------
+c  Parameter lm (declared and set in "npbparams.h") is the log-base2 of 
+c  the edge size max for the partition on a given node, so must be changed 
+c  either to save space (if running a small case) or made bigger for larger 
+c  cases, for example, 512^3. Thus lm=7 means that the largest dimension 
+c  of a partition that can be solved on a node is 2^7 = 128. lm is set 
+c  automatically in npbparams.h
+c  Parameters ndim1, ndim2, ndim3 are the local problem dimensions. 
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer nm      ! actual dimension including ghost cells for communications
+     >      , nv      ! size of rhs array
+     >      , nr      ! size of residual array
+     >      , nm2     ! size of communication buffer
+     >      , maxlevel! maximum number of levels
+
+      parameter( nm=2+2**lm, nv=(2+2**ndim1)*(2+2**ndim2)*(2+2**ndim3) )
+      parameter( nm2=2*nm*nm, maxlevel=(lt_default+1) )
+      parameter( nr = (8*(nv+nm**2+5*nm+14*lt_default-7*lm))/7 )
+      integer maxprocs
+      parameter( maxprocs = 131072 )  ! this is the upper proc limit that 
+                                      ! the current "nr" parameter can handle
+c---------------------------------------------------------------------
+      integer nbr(3,-1:1,maxlevel), msg_type(3,-1:1)
+      integer  msg_id(3,-1:1,2),nx(maxlevel),ny(maxlevel),nz(maxlevel)
+      common /mg3/ nbr,msg_type,msg_id,nx,ny,nz
+
+      character class
+      common /ClassType/class
+
+      integer debug_vec(0:7)
+      common /my_debug/ debug_vec
+
+      integer ir(maxlevel), m1(maxlevel), m2(maxlevel), m3(maxlevel)
+      integer lt, lb
+      common /fap/ ir,m1,m2,m3,lt,lb
+
+      logical dead(maxlevel), give_ex(3,maxlevel), take_ex(3,maxlevel)
+      common /comm_ex/ dead, give_ex, take_ex
+
+c---------------------------------------------------------------------
+c  Set at m=1024, can handle cases up to 1024^3 case
+c---------------------------------------------------------------------
+      integer m
+c      parameter( m=1037 )
+      parameter( m=nm+1 )
+
+      double precision buff(nm2,4)
+      common /buffer/ buff
+
+c---------------------------------------------------------------------
+      integer t_bench, t_init, t_psinv, t_resid, t_rprj3, t_interp, 
+     >        t_norm2u3, t_comm3, t_rcomm, t_last
+      parameter (t_bench=1, t_init=2, t_psinv=3, t_resid=4, t_rprj3=5,  
+     >        t_interp=6, t_norm2u3=7, t_comm3=8, 
+     >        t_rcomm=9, t_last=9)
+
+      logical timeron
+      common /timers/ timeron
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.f
new file mode 100644
index 0000000..89fefba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.f
@@ -0,0 +1,2568 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   M G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: E. Barszcz
+c          P. Frederickson
+c          A. Woo
+c          M. Yarrow
+c          R. F. Van der Wijngaart
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+      program mg_mpi
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+c---------------------------------------------------------------------------c
+c k is the current level. It is passed down through subroutine args
+c and is NOT global. it is the current iteration
+c---------------------------------------------------------------------------c
+
+      integer k, it
+      
+      external timer_read
+      double precision t, t0, tinit, mflops, timer_read
+
+c---------------------------------------------------------------------------c
+c These arrays are in common because they are quite large
+c and probably shouldn't be allocated on the stack. They
+c are always passed as subroutine args. 
+c---------------------------------------------------------------------------c
+
+      double precision u(nr),v(nv),r(nr),a(0:3),c(0:3)
+      common /noautom/ u,v,r   
+
+      double precision rnm2, rnmu, old2, oldu, epsilon
+      integer n1, n2, n3, nit
+      double precision nn, verify_value, err
+      logical verified
+
+      integer ierr,i, fstatus
+
+      double precision tsum(t_last+2), t1(t_last+2),
+     >                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'init', 'psinv', 'resid', 'rprj3', 
+     >            'interp', 'norm2u3', 'comm3', 'rcomm',
+     >            ' totcomp', ' totcomm'/
+
+
+      call mpi_init(ierr)
+      call mpi_comm_rank(mpi_comm_world, me, ierr)
+      call mpi_comm_size(mpi_comm_world, nprocs, ierr)
+
+      root = 0
+      if (nprocs_compiled .gt. maxprocs) then
+         if (me .eq. root) write(*,20) nprocs_compiled, maxprocs
+ 20      format(' ERROR: compiled for ',i8,' processes'//
+     &          ' The maximum size allowed for this benchmark is ',i6)
+         call mpi_abort(MPI_COMM_WORLD, 1, ierr)
+         stop
+      endif
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+      call mpi_barrier(MPI_COMM_WORLD, ierr)
+
+      call timer_start(T_init)
+      
+
+c---------------------------------------------------------------------
+c Read in and broadcast input data
+c---------------------------------------------------------------------
+
+      if( me .eq. root )then
+         write (*, 1000) 
+
+         open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+         timeron = .false.
+         if (fstatus .eq. 0) then
+            timeron = .true.
+            close(2)
+         endif
+
+         open(unit=7,file="mg.input", status="old", iostat=fstatus)
+         if (fstatus .eq. 0) then
+            write(*,50) 
+ 50         format(' Reading from input file mg.input')
+            read(7,*) lt
+            read(7,*) nx(lt), ny(lt), nz(lt)
+            read(7,*) nit
+            read(7,*) (debug_vec(i),i=0,7)
+         else
+            write(*,51) 
+ 51         format(' No input file. Using compiled defaults ')
+            lt = lt_default
+            nit = nit_default
+            nx(lt) = nx_default
+            ny(lt) = ny_default
+            nz(lt) = nz_default
+            do i = 0,7
+               debug_vec(i) = debug_default
+            end do
+         endif
+      endif
+
+      call mpi_bcast(lt, 1, MPI_INTEGER, 0, mpi_comm_world, ierr)
+      call mpi_bcast(nit, 1, MPI_INTEGER, 0, mpi_comm_world, ierr)
+      call mpi_bcast(nx(lt), 1, MPI_INTEGER, 0, mpi_comm_world, ierr)
+      call mpi_bcast(ny(lt), 1, MPI_INTEGER, 0, mpi_comm_world, ierr)
+      call mpi_bcast(nz(lt), 1, MPI_INTEGER, 0, mpi_comm_world, ierr)
+      call mpi_bcast(debug_vec(0), 8, MPI_INTEGER, 0, 
+     >               mpi_comm_world, ierr)
+      call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, mpi_comm_world, ierr)
+
+      if ( (nx(lt) .ne. ny(lt)) .or. (nx(lt) .ne. nz(lt)) ) then
+         Class = 'U' 
+      else if( nx(lt) .eq. 32 .and. nit .eq. 4 ) then
+         Class = 'S'
+      else if( nx(lt) .eq. 128 .and. nit .eq. 4 ) then
+         Class = 'W'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 4 ) then  
+         Class = 'A'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 20 ) then
+         Class = 'B'
+      else if( nx(lt) .eq. 512 .and. nit .eq. 20 ) then  
+         Class = 'C'
+      else if( nx(lt) .eq. 1024 .and. nit .eq. 50 ) then  
+         Class = 'D'
+      else if( nx(lt) .eq. 2048 .and. nit .eq. 50 ) then  
+         Class = 'E'
+      else
+         Class = 'U'
+      endif
+
+c---------------------------------------------------------------------
+c  Use these for debug info:
+c---------------------------------------------------------------------
+c     debug_vec(0) = 1 !=> report all norms
+c     debug_vec(1) = 1 !=> some setup information
+c     debug_vec(1) = 2 !=> more setup information
+c     debug_vec(2) = k => at level k or below, show result of resid
+c     debug_vec(3) = k => at level k or below, show result of psinv
+c     debug_vec(4) = k => at level k or below, show result of rprj
+c     debug_vec(5) = k => at level k or below, show result of interp
+c     debug_vec(6) = 1 => (unused)
+c     debug_vec(7) = 1 => (unused)
+c---------------------------------------------------------------------
+      a(0) = -8.0D0/3.0D0 
+      a(1) =  0.0D0 
+      a(2) =  1.0D0/6.0D0 
+      a(3) =  1.0D0/12.0D0
+      
+      if(Class .eq. 'A' .or. Class .eq. 'S'.or. Class .eq.'W') then
+c---------------------------------------------------------------------
+c     Coefficients for the S(a) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/8.0D0
+         c(1) =  +1.0D0/32.0D0
+         c(2) =  -1.0D0/64.0D0
+         c(3) =   0.0D0
+      else
+c---------------------------------------------------------------------
+c     Coefficients for the S(b) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/17.0D0
+         c(1) =  +1.0D0/33.0D0
+         c(2) =  -1.0D0/61.0D0
+         c(3) =   0.0D0
+      endif
+      lb = 1
+      k  = lt
+
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call norm2u3(v,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      if( me .eq. root )then
+         write (*, 1001) nx(lt),ny(lt),nz(lt), Class
+         write (*, 1002) nit
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.3 -- MG Benchmark', /)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', A, ')' )
+ 1002    format(' Iterations: ', i4)
+ 1003    format(' Number of processes: ', i6)
+         if (nprocs .ne. nprocs_compiled) then
+           write (*, 1004) nprocs_compiled
+           write (*, 1005) nprocs
+ 1004      format(' WARNING: compiled for ', i6, ' processes ')
+ 1005      format(' Number of active processes: ', i6, /)
+         else
+           write (*, 1003) nprocs
+         endif
+      endif
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+c---------------------------------------------------------------------
+c     One iteration for startup
+c---------------------------------------------------------------------
+      call mg3P(u,v,r,a,c,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call timer_stop(T_init)
+      if( me .eq. root )then
+         tinit = timer_read(T_init)
+         write( *,'(/A,F15.3,A/)' ) 
+     >        ' Initialization time: ',tinit, ' seconds'
+      endif
+
+      do i = 1, t_last
+         if (i .ne. t_init) call timer_clear(i)
+      end do
+      call mpi_barrier(mpi_comm_world,ierr)
+
+      call timer_start(T_bench)
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+      do  it=1,nit
+         if (it.eq.1 .or. it.eq.nit .or. mod(it,5).eq.0) then
+            if (me .eq. root) write(*,80) it
+   80       format('  iter ',i4)
+         endif
+         call mg3P(u,v,r,a,c,n1,n2,n3,k)
+         call resid(u,v,r,n1,n2,n3,a,k)
+      enddo
+
+
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      call timer_stop(T_bench)
+
+      t0 = timer_read(T_bench)
+
+      call mpi_reduce(t0,t,1,dp_type,
+     >     mpi_max,root,mpi_comm_world,ierr)
+      verified = .FALSE.
+      verify_value = 0.0
+      if( me .eq. root )then
+         write(*,100)
+ 100     format(/' Benchmark completed ')
+
+         epsilon = 1.d-8
+         if (Class .ne. 'U') then
+            if(Class.eq.'S') then
+               verify_value = 0.5307707005734d-04
+            elseif(Class.eq.'W') then
+               verify_value = 0.6467329375339d-05
+            elseif(Class.eq.'A') then
+               verify_value = 0.2433365309069d-05
+            elseif(Class.eq.'B') then
+               verify_value = 0.1800564401355d-05
+            elseif(Class.eq.'C') then
+               verify_value = 0.5706732285740d-06
+            elseif(Class.eq.'D') then
+               verify_value = 0.1583275060440d-09
+            elseif(Class.eq.'E') then
+               verify_value = 0.5630442584711d-10
+            endif
+
+            err = abs( rnm2 - verify_value ) / verify_value
+            if( err .le. epsilon ) then
+               verified = .TRUE.
+               write(*, 200)
+               write(*, 201) rnm2
+               write(*, 202) err
+ 200           format(' VERIFICATION SUCCESSFUL ')
+ 201           format(' L2 Norm is ', E20.13)
+ 202           format(' Error is   ', E20.13)
+            else
+               verified = .FALSE.
+               write(*, 300) 
+               write(*, 301) rnm2
+               write(*, 302) verify_value
+ 300           format(' VERIFICATION FAILED')
+ 301           format(' L2 Norm is             ', E20.13)
+ 302           format(' The correct L2 Norm is ', E20.13)
+            endif
+         else
+            verified = .FALSE.
+            write (*, 400)
+            write (*, 401)
+            write (*, 201) rnm2
+ 400        format(' Problem size unknown')
+ 401        format(' NO VERIFICATION PERFORMED')
+         endif
+
+         nn = 1.0d0*nx(lt)*ny(lt)*nz(lt)
+
+         if( t .ne. 0. ) then
+            mflops = 58.*1.0D-6*nit*nn / t
+         else
+            mflops = 0.0
+         endif
+
+         call print_results('MG', class, nx(lt), ny(lt), nz(lt), 
+     >                      nit, nprocs_compiled, nprocs, t,
+     >                      mflops, '          floating point', 
+     >                      verified, npbversion, compiletime,
+     >                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      endif
+
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_last+2) = t1(t_rcomm) + t1(t_comm3)
+      t1(t_last+1) = t1(t_bench) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, 
+     >                0, MPI_COMM_WORLD, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                0, MPI_COMM_WORLD, ierr)
+
+      if (me .eq. 0) then
+         write(*, 800) nprocs
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / nprocs
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup(n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1,n2,n3,k
+      integer dx, dy, log_p, d, i, j
+
+      integer ax, next(3),mi(3,maxlevel),mip(3,maxlevel)
+      integer ng(3,maxlevel)
+      integer idi(3), pi(3), idin(3,-1:1)
+      integer s, dir,ierr
+
+      do  j=-1,1,1
+         do  d=1,3
+            msg_type(d,j) = 100*(j+2+10*d)
+         enddo
+      enddo
+
+      ng(1,lt) = nx(lt)
+      ng(2,lt) = ny(lt)
+      ng(3,lt) = nz(lt)
+      do  ax=1,3
+         next(ax) = 1
+         do  k=lt-1,1,-1
+            ng(ax,k) = ng(ax,k+1)/2
+         enddo
+      enddo
+ 61   format(10i4)
+      do  k=lt,1,-1
+         nx(k) = ng(1,k)
+         ny(k) = ng(2,k)
+         nz(k) = ng(3,k)
+      enddo
+
+      log_p  = log(float(nprocs)+0.0001)/log(2.0)
+      dx     = log_p/3
+      pi(1)  = 2**dx
+      idi(1) = mod(me,pi(1))
+
+      dy     = (log_p-dx)/2
+      pi(2)  = 2**dy
+      idi(2) = mod((me/pi(1)),pi(2))
+
+      pi(3)  = nprocs/(pi(1)*pi(2))
+      idi(3) = me/(pi(1)*pi(2))
+
+      do  k = lt,1,-1
+         dead(k) = .false.
+         do  ax = 1,3
+            take_ex(ax,k) = .false.
+            give_ex(ax,k) = .false.
+
+            mi(ax,k) = 2 + 
+     >           ((idi(ax)+1)*ng(ax,k))/pi(ax) -
+     >           ((idi(ax)+0)*ng(ax,k))/pi(ax)
+            mip(ax,k) = 2 + 
+     >           ((next(ax)+idi(ax)+1)*ng(ax,k))/pi(ax) -
+     >           ((next(ax)+idi(ax)+0)*ng(ax,k))/pi(ax) 
+
+            if(mip(ax,k).eq.2.or.mi(ax,k).eq.2)then
+               next(ax) = 2*next(ax)
+            endif
+
+            if( k+1 .le. lt )then
+               if((mip(ax,k).eq.2).and.(mi(ax,k).eq.3))then
+                  give_ex(ax,k+1) = .true.
+               endif
+               if((mip(ax,k).eq.3).and.(mi(ax,k).eq.2))then
+                  take_ex(ax,k+1) = .true.
+               endif
+            endif
+         enddo
+
+         if( mi(1,k).eq.2 .or. 
+     >        mi(2,k).eq.2 .or. 
+     >        mi(3,k).eq.2      )then
+            dead(k) = .true.
+         endif
+         m1(k) = mi(1,k)
+         m2(k) = mi(2,k)
+         m3(k) = mi(3,k)
+
+         do  ax=1,3
+            idin(ax,+1) = mod( idi(ax) + next(ax) + pi(ax) , pi(ax) )
+            idin(ax,-1) = mod( idi(ax) - next(ax) + pi(ax) , pi(ax) )
+         enddo
+         do  dir = 1,-1,-2
+            nbr(1,dir,k) = idin(1,dir) + pi(1)
+     >           *(idi(2)      + pi(2)
+     >           * idi(3))
+            nbr(2,dir,k) = idi(1)      + pi(1)
+     >           *(idin(2,dir) + pi(2)
+     >           * idi(3))
+            nbr(3,dir,k) = idi(1)      + pi(1)
+     >           *(idi(2)      + pi(2)
+     >           * idin(3,dir))
+         enddo
+      enddo
+
+      k = lt
+      is1 = 2 + ng(1,k) - ((pi(1)  -idi(1))*ng(1,lt))/pi(1)
+      ie1 = 1 + ng(1,k) - ((pi(1)-1-idi(1))*ng(1,lt))/pi(1)
+      n1 = 3 + ie1 - is1
+      is2 = 2 + ng(2,k) - ((pi(2)  -idi(2))*ng(2,lt))/pi(2)
+      ie2 = 1 + ng(2,k) - ((pi(2)-1-idi(2))*ng(2,lt))/pi(2)
+      n2 = 3 + ie2 - is2
+      is3 = 2 + ng(3,k) - ((pi(3)  -idi(3))*ng(3,lt))/pi(3)
+      ie3 = 1 + ng(3,k) - ((pi(3)-1-idi(3))*ng(3,lt))/pi(3)
+      n3 = 3 + ie3 - is3
+
+
+      ir(lt)=1
+      do  j = lt-1, 1, -1
+         ir(j)=ir(j+1)+m1(j+1)*m2(j+1)*m3(j+1)
+      enddo
+
+
+      if( debug_vec(1) .ge. 1 )then
+         if( me .eq. root )write(*,*)' in setup, '
+         if( me .eq. root )write(*,*)' me   k  lt  nx  ny  nz ',
+     >        ' n1  n2  n3 is1 is2 is3 ie1 ie2 ie3'
+         do  i=0,nprocs-1
+            if( me .eq. i )then
+               write(*,9) me,k,lt,ng(1,k),ng(2,k),ng(3,k),
+     >              n1,n2,n3,is1,is2,is3,ie1,ie2,ie3
+ 9             format(15i4)
+            endif
+            call mpi_barrier(mpi_comm_world,ierr)
+         enddo
+      endif
+      if( debug_vec(1) .ge. 2 )then
+         do  i=0,nprocs-1
+            if( me .eq. i )then
+               write(*,*)' '
+               write(*,*)' processor =',me
+               do  k=lt,1,-1
+                  write(*,7)k,idi(1),idi(2),idi(3),
+     >                 ((nbr(d,j,k),j=-1,1,2),d=1,3),
+     >                 (mi(d,k),d=1,3)
+               enddo
+ 7             format(i4,'idi=',3i4,'nbr=',3(2i4,'  '),'mi=',3i4,' ')
+               write(*,*)'idi(s) = ',(idi(s),s=1,3)
+               write(*,*)'dead(2), dead(1) = ',dead(2),dead(1)
+               do  ax=1,3
+                  write(*,*)'give_ex(ax,2)= ',give_ex(ax,2)
+                  write(*,*)'take_ex(ax,2)= ',take_ex(ax,2)
+               enddo
+            endif
+            call mpi_barrier(mpi_comm_world,ierr)
+         enddo
+      endif
+
+      k = lt
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine mg3P(u,v,r,a,c,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multigrid V-cycle routine
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer n1, n2, n3, k
+      double precision u(nr),v(nv),r(nr)
+      double precision a(0:3),c(0:3)
+
+      integer j
+
+c---------------------------------------------------------------------
+c     down cycle.
+c     restrict the residual from the find grid to the coarse
+c---------------------------------------------------------------------
+
+      do  k= lt, lb+1 , -1
+         j = k-1
+         call rprj3(r(ir(k)),m1(k),m2(k),m3(k),
+     >        r(ir(j)),m1(j),m2(j),m3(j),k)
+      enddo
+
+      k = lb
+c---------------------------------------------------------------------
+c     compute an approximate solution on the coarsest grid
+c---------------------------------------------------------------------
+      call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+      call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+
+      do  k = lb+1, lt-1     
+          j = k-1
+c---------------------------------------------------------------------
+c        prolongate from level k-1  to k
+c---------------------------------------------------------------------
+         call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+         call interp(u(ir(j)),m1(j),m2(j),m3(j),
+     >               u(ir(k)),m1(k),m2(k),m3(k),k)
+c---------------------------------------------------------------------
+c        compute residual for level k
+c---------------------------------------------------------------------
+         call resid(u(ir(k)),r(ir(k)),r(ir(k)),m1(k),m2(k),m3(k),a,k)
+c---------------------------------------------------------------------
+c        apply smoother
+c---------------------------------------------------------------------
+         call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+      enddo
+ 200  continue
+      j = lt - 1
+      k = lt
+      call interp(u(ir(j)),m1(j),m2(j),m3(j),u,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call psinv(r,u,n1,n2,n3,c,k)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine psinv( r,u,n1,n2,n3,c,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     psinv applies an approximate inverse as smoother:  u = u + Cr
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Presuming coefficient c(3) is zero (the NPB assumes this,
+c     but it is thus not a general case), 2A + 1M may be eliminated,
+c     resulting in 13A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),r(n1,n2,n3),c(0:3)
+      integer i3, i2, i1
+
+      double precision r1(m), r2(m)
+      
+      if (timeron) call timer_start(t_psinv)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3)
+     >                + r(i1,i2,i3-1) + r(i1,i2,i3+1)
+               r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)
+     >                + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               u(i1,i2,i3) = u(i1,i2,i3)
+     >                     + c(0) * r(i1,i2,i3)
+     >                     + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3)
+     >                              + r1(i1) )
+     >                     + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) )
+c---------------------------------------------------------------------
+c  Assume c(3) = 0    (Enable line below if c(3) not= 0)
+c---------------------------------------------------------------------
+c    >                     + c(3) * ( r2(i1-1) + r2(i1+1) )
+c---------------------------------------------------------------------
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_psinv)
+
+c---------------------------------------------------------------------
+c     exchange boundary points
+c---------------------------------------------------------------------
+      call comm3(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(u,n1,n2,n3,'   psinv',k)
+      endif
+
+      if( debug_vec(3) .ge. k )then
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine resid( u,v,r,n1,n2,n3,a,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     resid computes the residual:  r = v - Au
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition (or Subtraction) and 
+c     Multiplication, respectively. 
+c     Presuming coefficient a(1) is zero (the NPB assumes this,
+c     but it is thus not a general case), 3A + 1M may be eliminated,
+c     resulting in 12A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
+      integer i3, i2, i1
+      double precision u1(m), u2(m)
+
+      if (timeron) call timer_start(t_resid)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
+     >                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
+               u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
+     >                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               r(i1,i2,i3) = v(i1,i2,i3)
+     >                     - a(0) * u(i1,i2,i3)
+c---------------------------------------------------------------------
+c  Assume a(1) = 0      (Enable 2 lines below if a(1) not= 0)
+c---------------------------------------------------------------------
+c    >                     - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
+c    >                              + u1(i1) )
+c---------------------------------------------------------------------
+     >                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
+     >                     - a(3) * ( u2(i1-1) + u2(i1+1) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_resid)
+
+c---------------------------------------------------------------------
+c     exchange boundary data
+c---------------------------------------------------------------------
+      call comm3(r,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(r,n1,n2,n3,'   resid',k)
+      endif
+
+      if( debug_vec(2) .ge. k )then
+         call showall(r,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rprj3( r,m1k,m2k,m3k,s,m1j,m2j,m3j,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rprj3 projects onto the next coarser grid, 
+c     using a trilinear Finite Element projection:  s = r' = P r
+c     
+c     This  implementation costs  20A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer m1k, m2k, m3k, m1j, m2j, m3j,k
+      double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j)
+      integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j
+
+      double precision x1(m), y1(m), x2,y2
+
+
+      if (timeron) call timer_start(t_rprj3)
+      if(m1k.eq.3)then
+        d1 = 2
+      else
+        d1 = 1
+      endif
+
+      if(m2k.eq.3)then
+        d2 = 2
+      else
+        d2 = 1
+      endif
+
+      if(m3k.eq.3)then
+        d3 = 2
+      else
+        d3 = 1
+      endif
+
+      do  j3=2,m3j-1
+         i3 = 2*j3-d3
+C        i3 = 2*j3-1
+         do  j2=2,m2j-1
+            i2 = 2*j2-d2
+C           i2 = 2*j2-1
+
+            do j1=2,m1j
+              i1 = 2*j1-d1
+C             i1 = 2*j1-1
+              x1(i1-1) = r(i1-1,i2-1,i3  ) + r(i1-1,i2+1,i3  )
+     >                 + r(i1-1,i2,  i3-1) + r(i1-1,i2,  i3+1)
+              y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1)
+     >                 + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1)
+            enddo
+
+            do  j1=2,m1j-1
+              i1 = 2*j1-d1
+C             i1 = 2*j1-1
+              y2 = r(i1,  i2-1,i3-1) + r(i1,  i2-1,i3+1)
+     >           + r(i1,  i2+1,i3-1) + r(i1,  i2+1,i3+1)
+              x2 = r(i1,  i2-1,i3  ) + r(i1,  i2+1,i3  )
+     >           + r(i1,  i2,  i3-1) + r(i1,  i2,  i3+1)
+              s(j1,j2,j3) =
+     >               0.5D0 * r(i1,i2,i3)
+     >             + 0.25D0 * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2)
+     >             + 0.125D0 * ( x1(i1-1) + x1(i1+1) + y2)
+     >             + 0.0625D0 * ( y1(i1-1) + y1(i1+1) )
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rprj3)
+
+
+      j = k-1
+      call comm3(s,m1j,m2j,m3j,j)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(s,m1j,m2j,m3j,'   rprj3',k-1)
+      endif
+
+      if( debug_vec(4) .ge. k )then
+         call showall(s,m1j,m2j,m3j)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine interp( z,mm1,mm2,mm3,u,n1,n2,n3,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     interp adds the trilinear interpolation of the correction
+c     from the coarser grid to the current approximation:  u = u + Qu'
+c     
+c     Observe that this  implementation costs  16A + 4M, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  Vector machines may get slightly better 
+c     performance however, with 8 separate "do i1" loops, rather than 4.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer mm1, mm2, mm3, n1, n2, n3,k
+      double precision z(mm1,mm2,mm3),u(n1,n2,n3)
+      integer i3, i2, i1, d1, d2, d3, t1, t2, t3
+
+c note that m = 1037 in globals.h but for this only need to be
+c 535 to handle up to 1024^3
+c      integer m
+c      parameter( m=535 )
+      double precision z1(m),z2(m),z3(m)
+
+
+      if (timeron) call timer_start(t_interp)
+      if( n1 .ne. 3 .and. n2 .ne. 3 .and. n3 .ne. 3 ) then
+
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+
+               do i1=1,mm1
+                  z1(i1) = z(i1,i2+1,i3) + z(i1,i2,i3)
+                  z2(i1) = z(i1,i2,i3+1) + z(i1,i2,i3)
+                  z3(i1) = z(i1,i2+1,i3+1) + z(i1,i2,i3+1) + z1(i1)
+               enddo
+
+               do  i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3-1)=u(2*i1-1,2*i2-1,2*i3-1)
+     >                 +z(i1,i2,i3)
+                  u(2*i1,2*i2-1,2*i3-1)=u(2*i1,2*i2-1,2*i3-1)
+     >                 +0.5d0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3-1)=u(2*i1-1,2*i2,2*i3-1)
+     >                 +0.5d0 * z1(i1)
+                  u(2*i1,2*i2,2*i3-1)=u(2*i1,2*i2,2*i3-1)
+     >                 +0.25d0*( z1(i1) + z1(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3)=u(2*i1-1,2*i2-1,2*i3)
+     >                 +0.5d0 * z2(i1)
+                  u(2*i1,2*i2-1,2*i3)=u(2*i1,2*i2-1,2*i3)
+     >                 +0.25d0*( z2(i1) + z2(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3)=u(2*i1-1,2*i2,2*i3)
+     >                 +0.25d0* z3(i1)
+                  u(2*i1,2*i2,2*i3)=u(2*i1,2*i2,2*i3)
+     >                 +0.125d0*( z3(i1) + z3(i1+1) )
+               enddo
+            enddo
+         enddo
+
+      else
+
+         if(n1.eq.3)then
+            d1 = 2
+            t1 = 1
+         else
+            d1 = 1
+            t1 = 0
+         endif
+         
+         if(n2.eq.3)then
+            d2 = 2
+            t2 = 1
+         else
+            d2 = 1
+            t2 = 0
+         endif
+         
+         if(n3.eq.3)then
+            d3 = 2
+            t3 = 1
+         else
+            d3 = 1
+            t3 = 0
+         endif
+         
+         do  i3=d3,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-d3)=u(2*i1-d1,2*i2-d2,2*i3-d3)
+     >                 +z(i1,i2,i3)
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-d3)=u(2*i1-t1,2*i2-d2,2*i3-d3)
+     >                 +0.5D0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-d3)=u(2*i1-d1,2*i2-t2,2*i3-d3)
+     >                 +0.5D0*(z(i1,i2+1,i3)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-d3)=u(2*i1-t1,2*i2-t2,2*i3-d3)
+     >                 +0.25D0*(z(i1+1,i2+1,i3)+z(i1+1,i2,i3)
+     >                 +z(i1,  i2+1,i3)+z(i1,  i2,i3))
+               enddo
+            enddo
+         enddo
+
+         do  i3=1,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-t3)=u(2*i1-d1,2*i2-d2,2*i3-t3)
+     >                 +0.5D0*(z(i1,i2,i3+1)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-t3)=u(2*i1-t1,2*i2-d2,2*i3-t3)
+     >                 +0.25D0*(z(i1+1,i2,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1+1,i2,i3  )+z(i1,i2,i3  ))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-t3)=u(2*i1-d1,2*i2-t2,2*i3-t3)
+     >                 +0.25D0*(z(i1,i2+1,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1,i2+1,i3  )+z(i1,i2,i3  ))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-t3)=u(2*i1-t1,2*i2-t2,2*i3-t3)
+     >                 +0.125D0*(z(i1+1,i2+1,i3+1)+z(i1+1,i2,i3+1)
+     >                 +z(i1  ,i2+1,i3+1)+z(i1  ,i2,i3+1)
+     >                 +z(i1+1,i2+1,i3  )+z(i1+1,i2,i3  )
+     >                 +z(i1  ,i2+1,i3  )+z(i1  ,i2,i3  ))
+               enddo
+            enddo
+         enddo
+
+      endif
+      if (timeron) call timer_stop(t_interp)
+
+      call comm3_ex(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(z,mm1,mm2,mm3,'z: inter',k-1)
+         call rep_nrm(u,n1,n2,n3,'u: inter',k)
+      endif
+
+      if( debug_vec(5) .ge. k )then
+         call showall(z,mm1,mm2,mm3)
+         call showall(u,n1,n2,n3)
+      endif
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine norm2u3(r,n1,n2,n3,rnm2,rnmu,nx0,ny0,nz0)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     norm2u3 evaluates approximations to the L2 norm and the
+c     uniform (or L-infinity or Chebyshev) norm, under the
+c     assumption that the boundaries are periodic or zero.  Add the
+c     boundaries in with half weight (quarter weight on the edges
+c     and eighth weight at the corners) for inhomogeneous boundaries.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer n1, n2, n3, nx0, ny0, nz0
+      double precision rnm2, rnmu, r(n1,n2,n3)
+      double precision s, a, ss
+      integer i3, i2, i1, ierr
+
+      double precision dn
+
+      if (timeron) call timer_start(t_norm2u3)
+      dn = 1.0d0*nx0*ny0*nz0
+
+      s=0.0D0
+      rnmu = 0.0D0
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               s=s+r(i1,i2,i3)**2
+               a=abs(r(i1,i2,i3))
+               if(a.gt.rnmu)rnmu=a
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_norm2u3)
+
+      if (timeron) call timer_start(t_rcomm)
+      call mpi_allreduce(rnmu,ss,1,dp_type,
+     >     mpi_max,mpi_comm_world,ierr)
+      rnmu = ss
+      call mpi_allreduce(s, ss, 1, dp_type,
+     >     mpi_sum,mpi_comm_world,ierr)
+      s = ss
+      if (timeron) call timer_stop(t_rcomm)
+      rnm2=sqrt( s / dn )
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rep_nrm(u,n1,n2,n3,title,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     report on norm
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      character*8 title
+
+      double precision rnm2, rnmu
+
+
+      call norm2u3(u,n1,n2,n3,rnm2,rnmu,nx(kk),ny(kk),nz(kk))
+      if( me .eq. root )then
+         write(*,7)kk,title,rnm2,rnmu
+ 7       format(' Level',i2,' in ',a8,': norms =',D21.14,D21.14)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm3(u,n1,n2,n3,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     comm3 organizes the communication on all borders 
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer axis
+
+      if( .not. dead(kk) )then
+         do  axis = 1, 3
+            if( nprocs .ne. 1) then
+   
+               call ready( axis, -1, kk )
+               call ready( axis, +1, kk )
+   
+               call give3( axis, +1, u, n1, n2, n3, kk )
+               call give3( axis, -1, u, n1, n2, n3, kk )
+   
+               call take3( axis, -1, u, n1, n2, n3 )
+               call take3( axis, +1, u, n1, n2, n3 )
+   
+            else
+               call comm1p( axis, u, n1, n2, n3, kk )
+            endif
+         enddo
+      else
+         call zero3(u,n1,n2,n3)
+      endif
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm3_ex(u,n1,n2,n3,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     comm3_ex  communicates to expand the number of processors
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer axis
+
+      do  axis = 1, 3
+         if( nprocs .ne. 1 ) then
+            if( take_ex( axis, kk ) )then
+               call ready( axis, -1, kk )
+               call ready( axis, +1, kk )
+               call take3_ex( axis, -1, u, n1, n2, n3 )
+               call take3_ex( axis, +1, u, n1, n2, n3 )
+            endif
+   
+            if( give_ex( axis, kk ) )then
+               call give3_ex( axis, +1, u, n1, n2, n3, kk )
+               call give3_ex( axis, -1, u, n1, n2, n3, kk )
+            endif
+         else
+            call comm1p_ex( axis, u, n1, n2, n3, kk )
+         endif
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ready( axis, dir, k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     ready allocates a buffer to take in a message
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, k
+      integer buff_id,buff_len,i,ierr
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+
+c---------------------------------------------------------------------
+c     fake message request type
+c---------------------------------------------------------------------
+      if (timeron) call timer_start(t_comm3)
+      msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me
+
+      call mpi_irecv( buff(1,buff_id), buff_len,
+     >     dp_type, nbr(axis,-dir,k), msg_type(axis,dir), 
+     >     mpi_comm_world, msg_id(axis,dir,1), ierr)
+      if (timeron) call timer_stop(t_comm3)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine give3( axis, dir, u, n1, n2, n3, k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     give3 sends border data out in the requested direction
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3, k, ierr
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( n1-1, i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len,  buff_id )= u( i1,n2-1,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,n3-1)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine take3( axis, dir, u, n1, n2, n3 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     take3 copies in border data from the requested direction
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer buff_id, indx
+
+      integer status(mpi_status_size), ierr
+
+      integer i3, i2, i1
+
+      if (timeron) call timer_start(t_comm3)
+      call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)
+      if (timeron) call timer_stop(t_comm3)
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  indx = indx + 1
+                  u(1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,1,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,1) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine give3_ex( axis, dir, u, n1, n2, n3, k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     give3_ex sends border data out to expand number of processors
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3, k, ierr
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len, buff_id
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=n1-1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id)= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=n2-1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id )= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=n3-1,n3
+               do  i2=1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len, buff_id ) = u( i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send( 
+     >           buff(1, buff_id ), buff_len,dp_type,
+     >           nbr( axis, dir, k ), msg_type(axis,dir), 
+     >           mpi_comm_world, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine take3_ex( axis, dir, u, n1, n2, n3 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     take3_ex copies in border data to expand number of processors
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer buff_id, indx
+
+      integer status(mpi_status_size) , ierr
+
+      integer i3, i2, i1
+
+      if (timeron) call timer_start(t_comm3)
+      call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)
+      if (timeron) call timer_stop(t_comm3)
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=1,2
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,2
+               do  i2=1,n2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm1p( axis, u, n1, n2, n3, kk )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+      integer i, kk, indx
+
+      dir = -1
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+
+      dir = +1
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+      dir = +1
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( n1-1, i2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len,  buff_id )= u( i1,n2-1,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,i2,n3-1)
+            enddo
+         enddo
+      endif
+
+      dir = -1
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               buff_len = buff_len + 1
+               buff(buff_len,buff_id ) = u( 2,  i2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,  2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,i2,2)
+            enddo
+         enddo
+      endif
+
+      do  i=1,nm2
+         buff(i,4) = buff(i,3)
+         buff(i,2) = buff(i,1)
+      enddo
+
+      dir = -1
+
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               indx = indx + 1
+               u(n1,i2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,n2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,i2,n3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+
+      dir = +1
+
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               indx = indx + 1
+               u(1,i2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,1,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,i2,1) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm1p_ex( axis, u, n1, n2, n3, kk )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      include 'globals.h'
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+      integer i, kk, indx
+
+      if( take_ex( axis, kk ) ) then
+
+         dir = -1
+
+         buff_id = 3 + dir
+         buff_len = nm2
+
+         do  i=1,nm2
+            buff(i,buff_id) = 0.0D0
+         enddo
+
+
+         dir = +1
+
+         buff_id = 3 + dir
+         buff_len = nm2
+
+         do  i=1,nm2
+            buff(i,buff_id) = 0.0D0
+         enddo
+
+
+         dir = -1
+
+         buff_id = 3 + dir
+         indx = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         dir = +1
+
+         buff_id = 3 + dir
+         indx = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=1,2
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i2=1,2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i3=1,2
+               do  i2=1,n2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+      endif
+
+      if( give_ex( axis, kk ) )then
+
+         dir = +1
+
+         buff_id = 2 + dir 
+         buff_len = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=n1-1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id)= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i2=n2-1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id )= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i3=n3-1,n3
+               do  i2=1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len, buff_id ) = u( i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         dir = -1
+
+         buff_id = 2 + dir 
+         buff_len = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+         endif
+
+      endif
+
+      do  i=1,nm2
+         buff(i,4) = buff(i,3)
+         buff(i,2) = buff(i,1)
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zran3(z,n1,n2,n3,nx,ny,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     zran3  loads +1 at ten randomly chosen points,
+c     loads -1 at a different ten random points,
+c     and zero elsewhere.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1, n2, n3, k, nx, ny, ierr, i0, m0, m1
+      double precision z(n1,n2,n3)
+
+      integer mm, i1, i2, i3, d1, e1, e2, e3
+      double precision x, a
+      double precision xx, x0, x1, a1, a2, ai, power
+      parameter( mm = 10,  a = 5.D0 ** 13, x = 314159265.D0)
+      double precision ten( mm, 0:1 ), temp, best
+      integer i, j1( mm, 0:1 ), j2( mm, 0:1 ), j3( mm, 0:1 )
+      integer jg( 0:3, mm, 0:1 ), jg_temp(4)
+
+      external randlc
+      double precision randlc, rdummy
+
+      a1 = power( a, nx, 1, 0 )
+      a2 = power( a, nx, ny, 0 )
+
+      call zero3(z,n1,n2,n3)
+
+c      i = is1-2+nx*(is2-2+ny*(is3-2))
+
+      ai = power( a, nx, is2-2+ny*(is3-2), is1-2 )
+      d1 = ie1 - is1 + 1
+      e1 = ie1 - is1 + 2
+      e2 = ie2 - is2 + 2
+      e3 = ie3 - is3 + 2
+      x0 = x
+      rdummy = randlc( x0, ai )
+      do  i3 = 2, e3
+         x1 = x0
+         do  i2 = 2, e2
+            xx = x1
+            call vranlc( d1, xx, a, z( 2, i2, i3 ))
+            rdummy = randlc( x1, a1 )
+         enddo
+         rdummy = randlc( x0, a2 )
+      enddo
+
+c---------------------------------------------------------------------
+c       call comm3(z,n1,n2,n3)
+c       call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     each processor looks for twenty candidates
+c---------------------------------------------------------------------
+      do  i=1,mm
+         ten( i, 1 ) = 0.0D0
+         j1( i, 1 ) = 0
+         j2( i, 1 ) = 0
+         j3( i, 1 ) = 0
+         ten( i, 0 ) = 1.0D0
+         j1( i, 0 ) = 0
+         j2( i, 0 ) = 0
+         j3( i, 0 ) = 0
+      enddo
+
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               if( z(i1,i2,i3) .gt. ten( 1, 1 ) )then
+                  ten(1,1) = z(i1,i2,i3) 
+                  j1(1,1) = i1
+                  j2(1,1) = i2
+                  j3(1,1) = i3
+                  call bubble( ten, j1, j2, j3, mm, 1 )
+               endif
+               if( z(i1,i2,i3) .lt. ten( 1, 0 ) )then
+                  ten(1,0) = z(i1,i2,i3) 
+                  j1(1,0) = i1
+                  j2(1,0) = i2
+                  j3(1,0) = i3
+                  call bubble( ten, j1, j2, j3, mm, 0 )
+               endif
+            enddo
+         enddo
+      enddo
+
+      call mpi_barrier(mpi_comm_world,ierr)
+
+c---------------------------------------------------------------------
+c     Now which of these are globally best?
+c---------------------------------------------------------------------
+      i1 = mm
+      i0 = mm
+      do  i=mm,1,-1
+
+         best = z( j1(i1,1), j2(i1,1), j3(i1,1) )
+         call mpi_allreduce(best,temp,1,dp_type,
+     >        mpi_max,mpi_comm_world,ierr)
+         best = temp
+         if(best.eq.z(j1(i1,1),j2(i1,1),j3(i1,1)))then
+            jg( 0, i, 1) = me
+            jg( 1, i, 1) = is1 - 2 + j1( i1, 1 ) 
+            jg( 2, i, 1) = is2 - 2 + j2( i1, 1 ) 
+            jg( 3, i, 1) = is3 - 2 + j3( i1, 1 ) 
+            i1 = i1-1
+         else
+            jg( 0, i, 1) = 0
+            jg( 1, i, 1) = 0
+            jg( 2, i, 1) = 0
+            jg( 3, i, 1) = 0
+         endif
+         ten( i, 1 ) = best
+         call mpi_allreduce(jg(0,i,1), jg_temp,4,MPI_INTEGER,
+     >        mpi_max,mpi_comm_world,ierr)
+         jg( 0, i, 1) =  jg_temp(1)
+         jg( 1, i, 1) =  jg_temp(2)
+         jg( 2, i, 1) =  jg_temp(3)
+         jg( 3, i, 1) =  jg_temp(4)
+
+         best = z( j1(i0,0), j2(i0,0), j3(i0,0) )
+         call mpi_allreduce(best,temp,1,dp_type,
+     >        mpi_min,mpi_comm_world,ierr)
+         best = temp
+         if(best.eq.z(j1(i0,0),j2(i0,0),j3(i0,0)))then
+            jg( 0, i, 0) = me
+            jg( 1, i, 0) = is1 - 2 + j1( i0, 0 ) 
+            jg( 2, i, 0) = is2 - 2 + j2( i0, 0 ) 
+            jg( 3, i, 0) = is3 - 2 + j3( i0, 0 ) 
+            i0 = i0-1
+         else
+            jg( 0, i, 0) = 0
+            jg( 1, i, 0) = 0
+            jg( 2, i, 0) = 0
+            jg( 3, i, 0) = 0
+         endif
+         ten( i, 0 ) = best
+         call mpi_allreduce(jg(0,i,0), jg_temp,4,MPI_INTEGER,
+     >        mpi_max,mpi_comm_world,ierr)
+         jg( 0, i, 0) =  jg_temp(1)
+         jg( 1, i, 0) =  jg_temp(2)
+         jg( 2, i, 0) =  jg_temp(3)
+         jg( 3, i, 0) =  jg_temp(4)
+
+      enddo
+      m1 = i1+1
+      m0 = i0+1
+
+c      if( me .eq. root) then
+c         write(*,*)' '
+c         write(*,*)' negative charges at'
+c         write(*,9)(jg(1,i,0),jg(2,i,0),jg(3,i,0),i=1,mm)
+c         write(*,*)' positive charges at'
+c         write(*,9)(jg(1,i,1),jg(2,i,1),jg(3,i,1),i=1,mm)
+c         write(*,*)' small random numbers were'
+c         write(*,8)(ten( i,0),i=mm,1,-1)
+c         write(*,*)' and they were found on processor number'
+c         write(*,7)(jg(0,i,0),i=mm,1,-1)
+c         write(*,*)' large random numbers were'
+c         write(*,8)(ten( i,1),i=mm,1,-1)
+c         write(*,*)' and they were found on processor number'
+c         write(*,7)(jg(0,i,1),i=mm,1,-1)
+c      endif
+c 9    format(5(' (',i3,2(',',i3),')'))
+c 8    format(5D15.8)
+c 7    format(10i4)
+      call mpi_barrier(mpi_comm_world,ierr)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3) = 0.0D0
+            enddo
+         enddo
+      enddo
+      do  i=mm,m0,-1
+         z( j1(i,0), j2(i,0), j3(i,0) ) = -1.0D0
+      enddo
+      do  i=mm,m1,-1
+         z( j1(i,1), j2(i,1), j3(i,1) ) = +1.0D0
+      enddo
+      call comm3(z,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c          call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine show_l(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer n1,n2,n3,i1,i2,i3,ierr
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3,i
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=1,m3
+               do  i1=1,m1
+                  write(*,6)(z(i1,i2,i3),i2=1,m2)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(6f15.11)
+         endif
+         call mpi_barrier(mpi_comm_world,ierr)
+      enddo
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine showall(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer n1,n2,n3,i1,i2,i3,i,ierr
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=1,m3
+               do  i1=1,m1
+                  write(*,6)(z(i1,i2,i3),i2=1,m2)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(15f6.3)
+         endif
+         call mpi_barrier(mpi_comm_world,ierr)
+      enddo
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine show(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+      integer n1,n2,n3,i1,i2,i3,ierr,i
+      double precision z(n1,n2,n3)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=2,n3-1
+               do  i1=2,n1-1
+                  write(*,6)(z(i1,i2,i3),i2=2,n1-1)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(8D10.3)
+         endif
+         call mpi_barrier(mpi_comm_world,ierr)
+      enddo
+
+c     call comm3(z,n1,n2,n3)
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function power( a, n1, n2, n3 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     power  raises an integer, disguised as a double
+c     precision real, to an integer power.
+c     This version tries to avoid integer overflow by treating
+c     it as expressed in a form of "n1*n2+n3".
+c---------------------------------------------------------------------
+      implicit none
+
+      double precision a, aj
+      integer n1, n2, n3
+
+      integer n1j, n2j, nj
+      external randlc
+      double precision randlc, rdummy
+
+      power = 1.0d0
+      aj = a
+      nj = n3
+      n1j = n1
+      n2j = n2
+ 100  continue
+
+      if( n2j .gt. 0 ) then
+         if( mod(n2j,2) .eq. 1 ) nj = nj + n1j
+         n2j = n2j/2
+      else if( nj .eq. 0 ) then
+         go to 200
+      endif
+      if( mod(nj,2) .eq. 1 ) rdummy =  randlc( power, aj )
+      rdummy = randlc( aj, aj )
+      nj = nj/2
+      go to 100
+
+ 200  continue
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine bubble( ten, j1, j2, j3, m, ind )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     bubble        does a bubble sort in direction dir
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer m, ind, j1( m, 0:1 ), j2( m, 0:1 ), j3( m, 0:1 )
+      double precision ten( m, 0:1 )
+      double precision temp
+      integer i, j_temp
+
+      if( ind .eq. 1 )then
+
+         do  i=1,m-1
+            if( ten(i,ind) .gt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      else
+
+         do  i=1,m-1
+            if( ten(i,ind) .lt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zero3(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'mpinpb.h'
+
+      integer n1, n2, n3
+      double precision z(n1,n2,n3)
+      integer i1, i2, i3
+
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3)=0.0D0
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c----- end of program ------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.input.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.input.sample
new file mode 100644
index 0000000..a4dcf81
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mg.input.sample
@@ -0,0 +1,4 @@
+ 8 = top level
+ 256 256 256 = nx ny nz
+ 20 = nit
+ 0 0 0 0 0 0 0 0 = debug_vec
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mpinpb.h
new file mode 100644
index 0000000..1f0368c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MG/mpinpb.h
@@ -0,0 +1,9 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include           'mpif.h'
+
+      integer           me, nprocs, root, dp_type
+      common /mpistuff/ me, nprocs, root, dp_type
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/Makefile
new file mode 100644
index 0000000..86288d7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/Makefile
@@ -0,0 +1,38 @@
+# Makefile for MPI dummy library. 
+# Must be edited for a specific machine. Does NOT read in 
+# the make.def file of NPB 2.3
+F77 = f77
+CC = cc
+AR = ar
+
+# Enable if either Cray or IBM: (no such flag for most machines: see wtime.h)
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+libmpi.a: mpi_dummy.o mpi_dummy_c.o wtime.o
+	$(AR) r libmpi.a mpi_dummy.o mpi_dummy_c.o wtime.o
+
+mpi_dummy.o: mpi_dummy.f mpif.h
+	$(F77) -c mpi_dummy.f
+# For a Cray C90, try:
+#	cf77 -dp -c mpi_dummy.f
+# For an IBM 590, try:
+#	xlf -c mpi_dummy.f
+
+mpi_dummy_c.o: mpi_dummy.c mpi.h
+	$(CC) -c ${MACHINE} -o mpi_dummy_c.o mpi_dummy.c
+
+wtime.o: wtime.c
+# For most machines or CRAY or IBM
+	$(CC) -c ${MACHINE} wtime.c
+# For a precise timer on an SGI Power Challenge, try:
+#	$(CC) -o wtime.o -c wtime_sgi64.c
+
+test: test.f
+	$(F77) -o test -I. test.f -L. -lmpi
+
+
+
+clean: 
+	- rm -f *~ *.o
+	- rm -f test libmpi.a
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/README
new file mode 100644
index 0000000..9096a0b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/README
@@ -0,0 +1,52 @@
+###########################################
+# NAS Parallel Benchmarks 2&3             #
+# MPI/F77/C                               #
+# Revision 3.3                            #
+# NASA Ames Research Center               #
+# npb@nas.nasa.gov                        #
+# http://www.nas.nasa.gov/Software/NPB/   #
+###########################################
+
+MPI Dummy Library
+
+
+The MPI dummy library is supplied as a convenience for people who do
+not have an MPI library but would like to try running on one processor
+anyway. The NPB 2.x/3.x benchmarks are designed so that they do not
+actually try to do any message passing when run on one node. The MPI
+dummy library is just that - a set of dummy MPI routines which don't
+do anything, but allow you to link the benchmarks. Actually they do a
+few things, but nothing important. Note that the dummy library is 
+sufficient only for the NPB 2.x/3.x benchmarks. It probably won't be
+useful for anything else because it implements only a handful of
+functions. 
+
+Because the dummy library is just an extra goody, and since we don't
+have an infinite amount of time, it may be a bit trickier to configure
+than the rest of the benchmarks. You need to:
+
+1. Find out how C and Fortran interact on your machine. On most machines, 
+the fortran functon foo(x) is declared in C as foo_(xp) where xp is 
+a pointer, not a value. On IBMs, it's just foo(xp). On Cray C90s, its
+FOO(xp). You can define CRAY or IBM to get these, or you need to
+edit wtime.c if you've got something else. 
+
+2. Edit the Makefile to compile mpi_dummy.f and wtime.c correctly
+for your machine (including -DCRAY or -DIBM if necessary). 
+
+3. The substitute MPI timer gives wall clock time, not CPU time. 
+If you're running on a timeshared machine, you may want to 
+use a CPU timer. Edit the function mpi_wtime() in mpi_dummy.f
+to change this timer. (NOTE: for official benchmark results, 
+ONLY wall clock times are valid. Using a CPU timer is ok 
+if you want to get things running, but don't report any results
+measured with a CPU timer. )
+
+TROUBLESHOOTING
+
+o Compiling or linking of the benchmark aborts because the dummy MPI
+  header file or the dummy MPI library cannot be found.
+  - the file make.dummy in subdirectory config relies on the use
+    of the -I"path" and -L"path" -l"library" constructs to pass
+    information to the compilers and linkers. Edit this file to conform
+    to your system.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi.h
new file mode 100644
index 0000000..a109b4b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi.h
@@ -0,0 +1,118 @@
+#define MPI_DOUBLE          1
+#define MPI_INT             2
+#define MPI_BYTE            3
+#define MPI_FLOAT           4
+#define MPI_LONG            5
+
+#define MPI_COMM_WORLD      0
+
+#define MPI_MAX             1
+#define MPI_SUM             2
+#define MPI_MIN             3
+
+#define MPI_SUCCESS         0
+#define MPI_ANY_SOURCE     -1
+#define MPI_ERR_OTHER      -1
+#define MPI_STATUS_SIZE     3
+
+
+/* 
+   Status object.  It is the only user-visible MPI data-structure 
+   The "count" field is PRIVATE; use MPI_Get_count to access it. 
+ */
+typedef struct { 
+    int count;
+    int MPI_SOURCE;
+    int MPI_TAG;
+    int MPI_ERROR;
+} MPI_Status;
+
+
+/* MPI request objects */
+typedef int MPI_Request;
+
+/* MPI datatype */
+typedef int MPI_Datatype;
+
+/* MPI comm */
+typedef int MPI_Comm;
+
+/* MPI operation */
+typedef int MPI_Op;
+
+
+
+/* Prototypes: */
+void  mpi_error( void );
+
+int   MPI_Irecv( void         *buf,
+                 int          count,
+                 MPI_Datatype datatype,
+                 int          source,
+                 int          tag,
+                 MPI_Comm     comm,
+                 MPI_Request  *request );
+
+int   MPI_Send( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          dest,
+                int          tag,
+                MPI_Comm     comm );
+
+int   MPI_Wait( MPI_Request *request,
+                MPI_Status  *status );
+
+int   MPI_Init( int  *argc,
+                char ***argv );
+
+int   MPI_Comm_rank( MPI_Comm comm, 
+                     int      *rank );
+
+int   MPI_Comm_size( MPI_Comm comm, 
+                     int      *size );
+
+double MPI_Wtime( void );
+
+int  MPI_Barrier( MPI_Comm comm );
+
+int  MPI_Bcast( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          root,
+                MPI_Comm     comm );
+
+int  MPI_Finalize( void );
+
+int  MPI_Allreduce( void         *sendbuf,
+                    void         *recvbuf,
+                    int          nitems,
+                    MPI_Datatype type,
+                    MPI_Op       op,
+                    MPI_Comm     comm );
+
+int  MPI_Reduce( void         *sendbuf,
+                 void         *recvbuf,
+                 int          nitems,
+                 MPI_Datatype type,
+                 MPI_Op       op,
+                 int          root,
+                 MPI_Comm     comm );
+
+int  MPI_Alltoall( void         *sendbuf,
+                   int          sendcount,
+                   MPI_Datatype sendtype,
+                   void         *recvbuf,
+                   int          recvcount,
+                   MPI_Datatype recvtype,
+                   MPI_Comm     comm );
+
+int  MPI_Alltoallv( void         *sendbuf,
+                    int          *sendcounts,
+                    int          *senddispl,
+                    MPI_Datatype sendtype,
+                    void         *recvbuf,
+                    int          *recvcounts,
+                    int          *recvdispl,
+                    MPI_Datatype recvtype,
+                    MPI_Comm     comm );
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.c
new file mode 100644
index 0000000..bec8749
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.c
@@ -0,0 +1,279 @@
+#include "mpi.h"
+#include "wtime.h"
+#include <stdlib.h>
+
+
+
+void  mpi_error( void )
+{
+    printf( "mpi_error called\n" );
+    abort();
+}
+
+
+
+
+int   MPI_Irecv( void         *buf,
+                 int          count,
+                 MPI_Datatype datatype,
+                 int          source,
+                 int          tag,
+                 MPI_Comm     comm,
+                 MPI_Request  *request )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Recv( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          source,
+                int          tag,
+                MPI_Comm     comm,
+                MPI_Status   *status )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Send( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          dest,
+                int          tag,
+                MPI_Comm     comm )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Wait( MPI_Request *request,
+                MPI_Status  *status )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Init( int  *argc,
+                char ***argv )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int   MPI_Comm_rank( MPI_Comm comm, 
+                     int      *rank )
+{
+    *rank = 0;
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int   MPI_Comm_size( MPI_Comm comm, 
+                     int      *size )
+{
+    *size = 1;
+    return( MPI_SUCCESS );
+}
+
+
+
+
+double MPI_Wtime( void )
+{
+    void wtime();
+
+    double t;
+    wtime( &t );
+    return( t );
+}
+
+
+
+
+int  MPI_Barrier( MPI_Comm comm )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Bcast( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          root,
+                MPI_Comm     comm )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Finalize( void )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Allreduce( void         *sendbuf,
+                    void         *recvbuf,
+                    int          nitems,
+                    MPI_Datatype type,
+                    MPI_Op       op,
+                    MPI_Comm     comm )
+{
+    int i;
+    if( type == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    if( type == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    if( type == MPI_DOUBLE )
+    {
+        double *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (double *) sendbuf;    
+        pd_recvbuf = (double *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Reduce( void         *sendbuf,
+                 void         *recvbuf,
+                 int          nitems,
+                 MPI_Datatype type,
+                 MPI_Op       op,
+                 int          root,
+                 MPI_Comm     comm )
+{
+    int i;
+    if( type == MPI_INT )
+    {
+        int *pi_sendbuf, *pi_recvbuf;
+        pi_sendbuf = (int *) sendbuf;    
+        pi_recvbuf = (int *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pi_recvbuf+i) = *(pi_sendbuf+i);
+    }
+    if( type == MPI_LONG )
+    {
+        long *pi_sendbuf, *pi_recvbuf;
+        pi_sendbuf = (long *) sendbuf;    
+        pi_recvbuf = (long *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pi_recvbuf+i) = *(pi_sendbuf+i);
+    }
+    if( type == MPI_DOUBLE )
+    {
+        double *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (double *) sendbuf;    
+        pd_recvbuf = (double *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Alltoall( void         *sendbuf,
+                   int          sendcount,
+                   MPI_Datatype sendtype,
+                   void         *recvbuf,
+                   int          recvcount,
+                   MPI_Datatype recvtype,
+                   MPI_Comm     comm )
+{
+    int i;
+    if( recvtype == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<sendcount; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    if( recvtype == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<sendcount; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Alltoallv( void         *sendbuf,
+                    int          *sendcounts,
+                    int          *senddispl,
+                    MPI_Datatype sendtype,
+                    void         *recvbuf,
+                    int          *recvcounts,
+                    int          *recvdispl,
+                    MPI_Datatype recvtype,
+                    MPI_Comm     comm )
+{
+    int i;
+    if( recvtype == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<sendcounts[0]; i++ )
+            *(pd_recvbuf+i+recvdispl[0]) = *(pd_sendbuf+i+senddispl[0]);
+    }
+    if( recvtype == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<sendcounts[0]; i++ )
+            *(pd_recvbuf+i+recvdispl[0]) = *(pd_sendbuf+i+senddispl[0]);
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.f
new file mode 100644
index 0000000..2550aa3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpi_dummy.f
@@ -0,0 +1,309 @@
+      subroutine mpi_isend(buf,count,datatype,source,
+     & tag,comm,request,ierror)
+      integer buf(*), count,datatype,source,tag,comm,
+     & request,ierror
+      call mpi_error()
+      return
+      end  
+
+      subroutine mpi_irecv(buf,count,datatype,source,
+     & tag,comm,request,ierror)
+      integer buf(*), count,datatype,source,tag,comm,
+     & request,ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_send(buf,count,datatype,dest,tag,comm,ierror)
+      integer buf(*), count,datatype,dest,tag,comm,ierror
+      call mpi_error()
+      return
+      end
+      
+      subroutine mpi_recv(buf,count,datatype,source,
+     & tag,comm,status,ierror)
+      integer buf(*), count,datatype,source,tag,comm,
+     & status(*),ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_comm_split(comm,color,key,newcomm,ierror)
+      integer comm,color,key,newcomm,ierror
+      return
+      end
+
+      subroutine mpi_comm_rank(comm, rank,ierr)
+      implicit none
+      integer comm, rank,ierr
+      rank = 0
+      return
+      end
+
+      subroutine mpi_comm_size(comm, size, ierr)
+      implicit none
+      integer comm, size, ierr
+      size = 1
+      return
+      end
+
+      double precision function mpi_wtime()
+      implicit none
+      double precision t
+c This function must measure wall clock time, not CPU time. 
+c Since there is no portable timer in Fortran (77)
+c we call a routine compiled in C (though the C source may have
+c to be tweaked). 
+      call wtime(t)
+c The following is not ok for "official" results because it reports
+c CPU time not wall clock time. It may be useful for developing/testing
+c on timeshared Crays, though. 
+c     call second(t)
+
+      mpi_wtime = t
+
+      return
+      end
+
+
+c may be valid to call this in single processor case
+      subroutine mpi_barrier(comm,ierror)
+      return
+      end
+
+c may be valid to call this in single processor case
+      subroutine mpi_bcast(buf, nitems, type, root, comm, ierr)
+      implicit none
+      integer buf(*), nitems, type, root, comm, ierr
+      return
+      end
+
+      subroutine mpi_comm_dup(oldcomm, newcomm,ierror)
+      integer oldcomm, newcomm,ierror
+      newcomm= oldcomm
+      return
+      end
+
+      subroutine mpi_error()
+      print *, 'mpi_error called'
+      stop
+      end 
+
+      subroutine mpi_abort(comm, errcode, ierr)
+      implicit none
+      integer comm, errcode, ierr
+      print *, 'mpi_abort called'
+      stop
+      end
+
+      subroutine mpi_finalize(ierr)
+      return
+      end
+
+      subroutine mpi_init(ierr)
+      return
+      end
+
+
+c assume double precision, which is all SP uses 
+      subroutine mpi_reduce(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      include 'mpif.h'
+      integer nitems, type, op, root, comm, ierr
+      double precision inbuf(*), outbuf(*)
+
+      if (type .eq. mpi_double_precision) then
+         call mpi_reduce_dp(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      else if (type .eq.  mpi_double_complex) then
+         call mpi_reduce_dc(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      else if (type .eq.  mpi_complex) then
+         call mpi_reduce_complex(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      else if (type .eq.  mpi_real) then
+         call mpi_reduce_real(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      else if (type .eq.  mpi_integer) then
+         call mpi_reduce_int(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      else 
+         print *, 'mpi_reduce: unknown type ', type
+      end if
+      return
+      end
+
+
+      subroutine mpi_reduce_real(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      integer nitems, type, op, root, comm, ierr, i
+      real inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_reduce_dp(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      integer nitems, type, op, root, comm, ierr, i
+      double precision inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_reduce_dc(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      integer nitems, type, op, root, comm, ierr, i
+      double complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+
+      subroutine mpi_reduce_complex(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      integer nitems, type, op, root, comm, ierr, i
+      complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_reduce_int(inbuf, outbuf, nitems, 
+     $                      type, op, root, comm, ierr)
+      implicit none
+      integer nitems, type, op, root, comm, ierr, i
+      integer inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_allreduce(inbuf, outbuf, nitems, 
+     $                      type, op, comm, ierr)
+      implicit none
+      integer nitems, type, op, comm, ierr
+      double precision inbuf(*), outbuf(*)
+
+      call mpi_reduce(inbuf, outbuf, nitems, 
+     $                      type, op, 0, comm, ierr)
+      return
+      end
+
+      subroutine mpi_alltoall(inbuf, nitems, type, outbuf, nitems_dum, 
+     $                        type_dum, comm, ierr)
+      implicit none
+      include 'mpif.h'
+      integer nitems, type, comm, ierr, nitems_dum, type_dum
+      double precision inbuf(*), outbuf(*)
+      if (type .eq. mpi_double_precision) then
+         call mpi_alltoall_dp(inbuf, outbuf, nitems, 
+     $                      type, comm, ierr)
+      else if (type .eq.  mpi_double_complex) then
+         call mpi_alltoall_dc(inbuf, outbuf, nitems, 
+     $                      type, comm, ierr)
+      else if (type .eq.  mpi_complex) then
+         call mpi_alltoall_complex(inbuf, outbuf, nitems, 
+     $                      type, comm, ierr)
+      else if (type .eq.  mpi_real) then
+         call mpi_alltoall_real(inbuf, outbuf, nitems, 
+     $                      type, comm, ierr)
+      else if (type .eq.  mpi_integer) then
+         call mpi_alltoall_int(inbuf, outbuf, nitems, 
+     $                      type, comm, ierr)
+      else 
+         print *, 'mpi_alltoall: unknown type ', type
+      end if
+      return
+      end
+
+      subroutine mpi_alltoall_dc(inbuf, outbuf, nitems, 
+     $                           type, comm, ierr)
+      implicit none
+      integer nitems, type, comm, ierr, i
+      double complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+
+      subroutine mpi_alltoall_complex(inbuf, outbuf, nitems, 
+     $                           type, comm, ierr)
+      implicit none
+      integer nitems, type, comm, ierr, i
+      double complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_alltoall_dp(inbuf, outbuf, nitems, 
+     $                           type, comm, ierr)
+      implicit none
+      integer nitems, type, comm, ierr, i
+      double precision inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_alltoall_real(inbuf, outbuf, nitems, 
+     $                             type, comm, ierr)
+      implicit none
+      integer nitems, type, comm, ierr, i
+      real inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_alltoall_int(inbuf, outbuf, nitems, 
+     $                            type, comm, ierr)
+      implicit none
+      integer nitems, type, comm, ierr, i
+      integer inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_wait(request,status,ierror)
+      integer request,status,ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_waitall(count,requests,status,ierror)
+      integer count,requests(*),status(*),ierror
+      call mpi_error()
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpif.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpif.h
new file mode 100644
index 0000000..f56c1ca
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/mpif.h
@@ -0,0 +1,28 @@
+      integer mpi_comm_world
+      parameter (mpi_comm_world = 0)
+
+      integer mpi_max, mpi_min, mpi_sum
+      parameter (mpi_max = 1, mpi_sum = 2, mpi_min = 3)
+
+      integer mpi_byte, mpi_integer, mpi_real, mpi_logical,
+     >                  mpi_double_precision,  mpi_complex,
+     >                  mpi_double_complex
+      parameter (mpi_double_precision = 1,
+     $           mpi_integer = 2, 
+     $           mpi_byte = 3, 
+     $           mpi_real= 4, 
+     $           mpi_logical = 5, 
+     $           mpi_complex = 6,
+     $           mpi_double_complex = 7)
+
+      integer mpi_any_source
+      parameter (mpi_any_source = -1)
+
+      integer mpi_err_other
+      parameter (mpi_err_other = -1)
+
+      double precision mpi_wtime
+      external mpi_wtime
+
+      integer mpi_status_size
+      parameter (mpi_status_size=3)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/test.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/test.f
new file mode 100644
index 0000000..081c73c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/test.f
@@ -0,0 +1,10 @@
+      program
+      implicit none
+      double precision t, mpi_wtime
+      external mpi_wtime
+      t = 0.0
+      t = mpi_wtime()
+      print *, t
+      t = mpi_wtime()
+      print *, t
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.c
new file mode 100644
index 0000000..221d222
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.c
@@ -0,0 +1,13 @@
+#include "wtime.h"
+#include <sys/time.h>
+
+void wtime(double *t)
+{
+  static int sec = -1;
+  struct timeval tv;
+  gettimeofday(&tv, (void *)0);
+  if (sec < 0) sec = tv.tv_sec;
+  *t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
+}
+
+    
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.f
new file mode 100644
index 0000000..a1cfde9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.f
@@ -0,0 +1,12 @@
+      subroutine wtime(tim)
+      real*8 tim
+      dimension tarray(2)
+      call etime(tarray)
+      tim = tarray(1)
+      return
+      end
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.h
new file mode 100644
index 0000000..12eb0cb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime.h
@@ -0,0 +1,12 @@
+/* C/Fortran interface is different on different machines. 
+ * You may need to tweak this.
+ */
+
+
+#if defined(IBM)
+#define wtime wtime
+#elif defined(CRAY)
+#define wtime WTIME
+#else
+#define wtime wtime_
+#endif
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime_sgi64.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime_sgi64.c
new file mode 100644
index 0000000..d08d50c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/MPI_dummy/wtime_sgi64.c
@@ -0,0 +1,74 @@
+#include <sys/types.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syssgi.h>
+#include <sys/immu.h>
+#include <errno.h>
+#include <stdio.h>
+
+/* The following works on SGI Power Challenge systems */
+
+typedef unsigned long iotimer_t;
+
+unsigned int cycleval;
+volatile iotimer_t *iotimer_addr, base_counter;
+double resolution;
+
+/* address_t is an integer type big enough to hold an address */
+typedef unsigned long address_t;
+
+
+
+void timer_init() 
+{
+  
+  int fd;
+  char *virt_addr;
+  address_t phys_addr, page_offset, pagemask, pagebase_addr;
+  
+  pagemask = getpagesize() - 1;
+  errno = 0;
+  phys_addr = syssgi(SGI_QUERY_CYCLECNTR, &cycleval);
+  if (errno != 0) {
+    perror("SGI_QUERY_CYCLECNTR");
+    exit(1);
+  }
+  /* rel_addr = page offset of physical address */
+  page_offset = phys_addr & pagemask;
+  pagebase_addr = phys_addr - page_offset;
+  fd = open("/dev/mmem", O_RDONLY);
+
+  virt_addr = mmap(0, pagemask, PROT_READ, MAP_PRIVATE, fd, pagebase_addr);
+  virt_addr = virt_addr + page_offset;
+  iotimer_addr = (iotimer_t *)virt_addr;
+  /* cycleval in picoseconds to this gives resolution in seconds */
+  resolution = 1.0e-12*cycleval; 
+  base_counter = *iotimer_addr;
+}
+
+void wtime_(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
+void wtime(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/Makefile
new file mode 100644
index 0000000..8f356aa
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/Makefile
@@ -0,0 +1,69 @@
+SHELL=/bin/sh
+CLASS=U
+NPROCS=1
+SUBTYPE=
+VERSION=
+SFILE=config/suite.def
+
+default: header
+	@ sys/print_instructions
+
+BT: bt
+bt: header
+	cd BT; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS) SUBTYPE=$(SUBTYPE) VERSION=$(VERSION)
+
+SP: sp
+sp: header
+	cd SP; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+LU: lu
+lu: header
+	cd LU; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS) VERSION=$(VERSION)
+
+MG: mg
+mg: header
+	cd MG; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+FT: ft
+ft: header
+	cd FT; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+IS: is
+is: header
+	cd IS; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+CG: cg
+cg: header
+	cd CG; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+EP: ep
+ep: header
+	cd EP; $(MAKE) NPROCS=$(NPROCS) CLASS=$(CLASS)
+
+DT: dt
+dt: header
+	cd DT; $(MAKE) CLASS=$(CLASS)
+
+# Awk script courtesy cmg@cray.com, modified by Haoqiang Jin
+suite:
+	@ awk -f sys/suite.awk SMAKE=$(MAKE) $(SFILE) | $(SHELL)
+
+
+# It would be nice to make clean in each subdirectory (the targets
+# are defined) but on a really clean system this will won't work
+# because those makefiles need config/make.def
+clean:
+	- rm -f core 
+	- rm -f *~ */core */*~ */*.o */npbparams.h */*.obj */*.exe
+	- rm -f MPI_dummy/test MPI_dummy/libmpi.a
+	- rm -f sys/setparams sys/makesuite sys/setparams.h
+	- rm -f btio.*.out*
+
+veryclean: clean
+	- rm -f config/make.def config/suite.def 
+	- rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.* 
+	- rm -f bin/ep.* bin/cg.* bin/dt.*
+
+header:
+	@ sys/print_header
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README
new file mode 100644
index 0000000..c5a1e7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README
@@ -0,0 +1,46 @@
+The MPI implementation of NPB 3.3 (NPB3.3-MPI)
+--------------------------------------------------
+
+For problem reports and suggestions on the implementation, 
+please contact:
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+   http://www.nas.nasa.gov/Software/NPB
+
+
+This directory contains the MPI implementation of the NAS
+Parallel Benchmarks, Version 3.3 (NPB3.3-MPI).  A brief
+summary of the new features introduced in this version is
+given below.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+For explanation of compilation and running of the benchmarks,
+please refer to README.install.  For a special note on DT, please
+see the README file in the DT subdirectory.
+This version includes additional timers in several benchmarks.
+
+
+New features in NPB3.3-MPI:
+  * NPB3.3-MPI introduces a new problem size (class E) to seven of  
+    the benchmarks (BT, SP, LU, CG, MG, FT, and EP).  The version 
+    also includes a new problem size (class D) for the IS benchmark, 
+    which was not present in the previous releases.
+
+  * The release is merged with the vector codes for the BT and LU
+    benchmarks, which can be selected with the VERSION=VEC option
+    during compilation.  However, it should be noted that successful
+    vectorization highly depends on the compiler used.  Some changes
+    to compiler directives for vectorization in the current codes
+    (see *_vec.f files) may be required.
+
+  * New improvements to BTIO (BT with IO subtypes):
+    - added I/O stats (I/O timing, data size written, I/O data rate)
+    - added an option for interleaving reads between writes through
+      the inputbt.data file.  Although the data file size would be
+      smaller as a result, the total amount of data written is still
+      the same.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README.install b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README.install
new file mode 100644
index 0000000..f81c297
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/README.install
@@ -0,0 +1,160 @@
+Some explanations on the MPI implementation of NPB 3.3 (NPB3.3-MPI)
+----------------------------------------------------------------------
+
+NPB-MPI is a sample MPI implementation based on NPB2.4 and NPB3.0-SER.
+This implementation contains all eight original benchmarks:
+Seven in Fortran: BT, SP, LU, FT, CG, MG, and EP; one in C: IS,
+as well as the DT benchmark, written in C, introduced in NPB3.2-MPI.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+This version has been tested, among others, on an SGI Origin3000 and
+an SGI Altix.  For problem reports and suggestions on the implementation, 
+please contact
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+CAUTION *********************************
+When running the I/O benchmark, one or more data files will be written
+in the directory from which the executable is invoked. They are not
+deleted at the end of the program. A new run will overwrite the old
+file(s). If not enough space is available in the user partition, the
+program will fail. For classes C and D the disk space required is
+3 GB and 135 GB, respectively.
+*****************************************
+
+
+1. Compilation
+
+   NPB3-MPI uses the same directory tree as NPB3-SER (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary.  
+   If it does not (yet) exist, copy 'make.def.template' or one of the
+   sample files in the NAS.samples subdirectory to 'make.def' and
+   edit the content for site- and machine-specific data.  Then
+
+       make <benchmark-name> NPROCS=<number> CLASS=<class> \
+         [SUBTYPE=<type>] [VERSION=VEC]
+
+   where <benchmark-name>  is "bt", "cg", "dt", "ep", "ft", "is", 
+                              "lu", "mg", or "sp"
+         <number>          is the number of processes
+         <class>           is "S", "W", "A", "B", "C", "D", or "E"
+
+   Classes C, D and E are not available for DT.
+   Class E is not available for IS.
+
+   The "VERSION=VEC" option is used for selecting the vectorized 
+   versions of BT and LU.
+
+   Only when making the I/O benchmark:
+         <benchmark-name>  is "bt"
+         <number>, <class> as above
+         <type>            is "full", "simple", "fortran", or "epio"
+
+   Three parameters not used in the original BT benchmark are present in
+   the I/O benchmark. Two are set by default in the file BT/bt.f. 
+   Changing them is optional.
+   One is set in make.def. It must be specified.
+
+   bt.f: collbuf_nodes: number of processes used to buffer data before
+                        writing to file in the collective buffering mode
+                        (<type> is "full").
+         collbuf_size:  size of buffer (in bytes) per process used in
+                        collective buffering
+
+   make.def: -DFORTRAN_REC_SIZE: Fortran I/O record length in bytes. This
+                        is a system-specific value. It is part of the
+                        definition string of variable CONVERTFLAG. Syntax:
+                        "CONVERTFLAG = -DFORTRAN_REC_SIZE=n", where n is
+                        the record length.
+
+   When <type> is "full" or "simple", the code must be linked with an
+   MPI library that contains the subset of IO routines defined in MPI 2.
+
+
+   Class D for IS (Integer Sort) requires a compiler/system that 
+   supports the "long" type in C to be 64-bit.  As examples, the SGI 
+   MIPS compiler for the SGI Origin using the "-64" compilation flag and
+   the Intel compiler for IA64 are known to work.
+
+
+   The above procedure allows you to build one benchmark
+   at a time. To build a whole suite, you can type "make suite"
+   Make will look in file "config/suite.def" for a list of 
+   executables to build. The file contains one line per specification, 
+   with comments preceded by "#". Each line contains the name
+   of a benchmark, the class, and the number of processors, separated
+   by spaces or tabs. config/suite.def.template contains an example
+   of such a file.
+
+
+   The benchmarks have been designed so that they can be run
+   on a single processor without an MPI library. A few "dummy" 
+   MPI routines are still required for linking. For convenience
+   such a library is supplied in the "MPI_dummy" subdirectory of
+   the distribution. It contains an mpif.h and mpi.f include files
+   which must be used as well. The dummy library is built and
+   linked automatically and paths to the include files are defined
+   by inserting the line "include ../config/make.dummy" into the
+   make.def file (see example in make.def.template). Make sure to 
+   read the warnings in the README file in "MPI_dummy".The use of
+   the library is fragile and can produce unexpected errors.
+
+
+   ================================
+   
+   The "RAND" variable in make.def
+   --------------------------------
+   
+   Most of the NPBs use a random number generator. In two of the NPBs (FT
+   and EP) the computation of random numbers is included in the timed
+   part of the calculation, and it is important that the random number
+   generator be efficient.  The default random number generator package
+   provided is called "randi8" and should be used where possible. It has 
+   the following requirements:
+   
+   randi8:
+     1. Uses integer*8 arithmetic. Compiler must support integer*8
+     2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
+     3. Assumes overflow bits are discarded by the hardware. In particular, 
+        that the lowest 46 bits of a*b are always correct, even if the 
+        result a*b is larger than 2^64. 
+   
+   Since randi8 may not work on all machines, we supply the following
+   alternatives:
+   
+   randi8_safe
+     1. Uses integer*8 arithmetic
+     2. Uses the Fortran 90 IBITS intrinsic. 
+     3. Does not make any assumptions about overflow. Should always
+        work correctly if compiler supports integer*8 and IBITS. 
+   
+   randdp
+     1. Uses double precision arithmetic (to simulate integer*8 operations). 
+        Should work with any system with support for 64-bit floating
+        point arithmetic.      
+   
+   randdpvec
+     1. Similar to randdp but written to be easier to vectorize. 
+   
+   
+2. Execution
+
+   The executable is named <benchmark-name>.<class>.<nprocs>[.<suffix>],
+   where <suffix> is "fortran_io", "mpi_io_simple",  "ep_io", or 
+                     "mpi_io_full"
+   The executable is placed in the bin subdirectory (or in the directory 
+   BINDIR specified in make.def, if you've defined it). The method for 
+   running the MPI program depends on your local system.
+   When any of the I/O benchmarks is run (non-empty subtype), one or 
+   more output files are created, and placed in the directory from which
+   the program was started. These are not removed automatically, and 
+   will be overwritten the next time an IO benchmark is run.
+
+   To enable additional timers in several benchmarks at runtime, create
+   a dummy file 'timer.flag' in the working directory before executing
+   a benchmark.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/Makefile
new file mode 100644
index 0000000..01508aa
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/Makefile
@@ -0,0 +1,60 @@
+SHELL=/bin/sh
+BENCHMARK=sp
+BENCHMARKU=SP
+
+include ../config/make.def
+
+
+OBJS = sp.o make_set.o initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o define.o copy_faces.o rhs.o      \
+       lhsx.o lhsy.o lhsz.o x_solve.o ninvr.o y_solve.o pinvr.o    \
+       z_solve.o tzetar.o add.o txinvr.o error.o verify.o setup_mpi.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f.o:
+	${FCOMPILE} $<
+
+sp.o:             sp.f  header.h npbparams.h  mpinpb.h
+make_set.o:       make_set.f  header.h npbparams.h  mpinpb.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+define.o:         define.f  header.h npbparams.h
+copy_faces.o:     copy_faces.f  header.h npbparams.h  mpinpb.h
+rhs.o:            rhs.f  header.h npbparams.h
+lhsx.o:           lhsx.f  header.h npbparams.h
+lhsy.o:           lhsy.f  header.h npbparams.h
+lhsz.o:           lhsz.f  header.h npbparams.h
+x_solve.o:        x_solve.f  header.h npbparams.h  mpinpb.h
+ninvr.o:          ninvr.f  header.h npbparams.h
+y_solve.o:        y_solve.f  header.h npbparams.h  mpinpb.h
+pinvr.o:          pinvr.f  header.h npbparams.h
+z_solve.o:        z_solve.f  header.h npbparams.h  mpinpb.h
+tzetar.o:         tzetar.f  header.h npbparams.h
+add.o:            add.f  header.h npbparams.h
+txinvr.o:         txinvr.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h  mpinpb.h
+verify.o:         verify.f  header.h npbparams.h  mpinpb.h
+setup_mpi.o:      setup_mpi.f mpinpb.h npbparams.h 
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/README
new file mode 100644
index 0000000..fe423db
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/README
@@ -0,0 +1,17 @@
+
+This code implements a 3D Multi-partition algorithm for the solution 
+of the uncoupled systems of linear equations resulting from 
+Beam-Warming approximate factorization.  Consequently, the program 
+must be run on a square number of processors.  The included file 
+"npbparams.h" contains a parameter statement which sets "maxcells" 
+and "problem_size".  The parameter maxcells must be set to the 
+square root of the number of processors.  For example, if running 
+on 25 processors, then set max_cells=5.  The standard problem sizes 
+are problem_size=64 for class A, 102 for class B, and 162 for class C.
+
+The number of time steps and the time step size dt are set in the 
+npbparams.h but may be overridden in the input deck "inputsp.data".  
+The number of time steps is 400 for all three 
+standard problems, and the appropriate time step sizes "dt" are 
+0.0015d0 for class A, 0.001d0 for class B, and 0.00067 for class C.  
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/add.f
new file mode 100644
index 0000000..cdc4765
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/add.f
@@ -0,0 +1,31 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c addition of update to the vector u
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer  c, i, j, k, m
+
+       do  c = 1, ncells
+          do m = 1, 5
+             do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      u(i,j,k,m,c) = u(i,j,k,m,c) + rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+       end do
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/adi.f
new file mode 100644
index 0000000..e55cfd6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/adi.f
@@ -0,0 +1,24 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       call copy_faces
+
+       call txinvr
+
+       call x_solve
+
+       call y_solve
+
+       call z_solve
+
+       call add
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/copy_faces.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/copy_faces.f
new file mode 100644
index 0000000..342a598
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/copy_faces.f
@@ -0,0 +1,312 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine copy_faces
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function copies the face values of a variable defined on a set 
+c of cells to the overlap locations of the adjacent sets of cells. 
+c Because a set of cells interfaces in each direction with exactly one 
+c other set, we only need to fill six different buffers. We could try to 
+c overlap communication with computation, by computing
+c some internal values while communicating boundary values, but this
+c adds so much overhead that it's not clearly useful. 
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer i, j, k, c, m, requests(0:11), p0, p1, 
+     >         p2, p3, p4, p5, b_size(0:5), ss(0:5), 
+     >         sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11)
+
+c---------------------------------------------------------------------
+c      exit immediately if there are no faces to be copied           
+c---------------------------------------------------------------------
+       if (no_nodes .eq. 1) then
+          call compute_rhs
+          return
+       endif
+
+
+       ss(0) = start_send_east
+       ss(1) = start_send_west
+       ss(2) = start_send_north
+       ss(3) = start_send_south
+       ss(4) = start_send_top
+       ss(5) = start_send_bottom
+
+       sr(0) = start_recv_east
+       sr(1) = start_recv_west
+       sr(2) = start_recv_north
+       sr(3) = start_recv_south
+       sr(4) = start_recv_top
+       sr(5) = start_recv_bottom
+
+       b_size(0) = east_size   
+       b_size(1) = west_size   
+       b_size(2) = north_size  
+       b_size(3) = south_size  
+       b_size(4) = top_size    
+       b_size(5) = bottom_size 
+
+c---------------------------------------------------------------------
+c because the difference stencil for the diagonalized scheme is 
+c orthogonal, we do not have to perform the staged copying of faces, 
+c but can send all face information simultaneously to the neighboring 
+c cells in all directions          
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_bpack)
+       p0 = 0
+       p1 = 0
+       p2 = 0
+       p3 = 0
+       p4 = 0
+       p5 = 0
+
+       do  c = 1, ncells
+          do   m = 1, 5
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to eastern neighbors (i-dir)
+c---------------------------------------------------------------------
+             if (cell_coord(1,c) .ne. ncells) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = cell_size(1,c)-2, cell_size(1,c)-1
+                         out_buffer(ss(0)+p0) = u(i,j,k,m,c)
+                         p0 = p0 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to western neighbors 
+c---------------------------------------------------------------------
+             if (cell_coord(1,c) .ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = 0, 1
+                         out_buffer(ss(1)+p1) = u(i,j,k,m,c)
+                         p1 = p1 + 1
+                      end do
+                   end do
+                end do
+
+
+             endif
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to northern neighbors (j_dir)
+c---------------------------------------------------------------------
+             if (cell_coord(2,c) .ne. ncells) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = cell_size(2,c)-2, cell_size(2,c)-1
+                      do   i = 0, cell_size(1,c)-1
+                         out_buffer(ss(2)+p2) = u(i,j,k,m,c)
+                         p2 = p2 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to southern neighbors 
+c---------------------------------------------------------------------
+             if (cell_coord(2,c).ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, 1
+                      do   i = 0, cell_size(1,c)-1   
+                         out_buffer(ss(3)+p3) = u(i,j,k,m,c)
+                         p3 = p3 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to top neighbors (k-dir)
+c---------------------------------------------------------------------
+             if (cell_coord(3,c) .ne. ncells) then
+                do   k = cell_size(3,c)-2, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = 0, cell_size(1,c)-1
+                         out_buffer(ss(4)+p4) = u(i,j,k,m,c)
+                         p4 = p4 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+c---------------------------------------------------------------------
+c            fill the buffer to be sent to bottom neighbors
+c---------------------------------------------------------------------
+             if (cell_coord(3,c).ne. 1) then
+                 do    k=0, 1
+                    do   j = 0, cell_size(2,c)-1
+                       do   i = 0, cell_size(1,c)-1
+                          out_buffer(ss(5)+p5) = u(i,j,k,m,c)
+                          p5 = p5 + 1
+                       end do
+                    end do
+                 end do
+              endif
+
+c---------------------------------------------------------------------
+c          m loop
+c---------------------------------------------------------------------
+           end do
+
+c---------------------------------------------------------------------
+c       cell loop
+c---------------------------------------------------------------------
+        end do
+       if (timeron) call timer_stop(t_bpack)
+
+       if (timeron) call timer_start(t_exch)
+       call mpi_irecv(in_buffer(sr(0)), b_size(0), 
+     >                dp_type, successor(1), WEST,  
+     >                comm_rhs, requests(0), error)
+       call mpi_irecv(in_buffer(sr(1)), b_size(1), 
+     >                dp_type, predecessor(1), EAST,  
+     >                comm_rhs, requests(1), error)
+       call mpi_irecv(in_buffer(sr(2)), b_size(2), 
+     >                dp_type, successor(2), SOUTH, 
+     >                comm_rhs, requests(2), error)
+       call mpi_irecv(in_buffer(sr(3)), b_size(3), 
+     >                dp_type, predecessor(2), NORTH, 
+     >                comm_rhs, requests(3), error)
+       call mpi_irecv(in_buffer(sr(4)), b_size(4), 
+     >                dp_type, successor(3), BOTTOM,
+     >                comm_rhs, requests(4), error)
+       call mpi_irecv(in_buffer(sr(5)), b_size(5), 
+     >                dp_type, predecessor(3), TOP,   
+     >                comm_rhs, requests(5), error)
+
+       call mpi_isend(out_buffer(ss(0)), b_size(0), 
+     >                dp_type, successor(1),   EAST, 
+     >                comm_rhs, requests(6), error)
+       call mpi_isend(out_buffer(ss(1)), b_size(1), 
+     >                dp_type, predecessor(1), WEST, 
+     >                comm_rhs, requests(7), error)
+       call mpi_isend(out_buffer(ss(2)), b_size(2), 
+     >                dp_type,successor(2),   NORTH, 
+     >                comm_rhs, requests(8), error)
+       call mpi_isend(out_buffer(ss(3)), b_size(3), 
+     >                dp_type,predecessor(2), SOUTH, 
+     >                comm_rhs, requests(9), error)
+       call mpi_isend(out_buffer(ss(4)), b_size(4), 
+     >                dp_type,successor(3),   TOP, 
+     >                comm_rhs,   requests(10), error)
+       call mpi_isend(out_buffer(ss(5)), b_size(5), 
+     >                dp_type,predecessor(3), BOTTOM, 
+     >                comm_rhs,requests(11), error)
+
+
+       call mpi_waitall(12, requests, statuses, error)
+       if (timeron) call timer_stop(t_exch)
+
+c---------------------------------------------------------------------
+c unpack the data that has just been received;             
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_bpack)
+       p0 = 0
+       p1 = 0
+       p2 = 0
+       p3 = 0
+       p4 = 0
+       p5 = 0
+
+       do   c = 1, ncells
+          do    m = 1, 5
+
+             if (cell_coord(1,c) .ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = -2, -1
+                         u(i,j,k,m,c) = in_buffer(sr(1)+p0)
+                         p0 = p0 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(1,c) .ne. ncells) then
+                do  k = 0, cell_size(3,c)-1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = cell_size(1,c), cell_size(1,c)+1
+                         u(i,j,k,m,c) = in_buffer(sr(0)+p1)
+                         p1 = p1 + 1
+                      end do
+                   end do
+                end do
+             end if
+ 
+             if (cell_coord(2,c) .ne. 1) then
+                do  k = 0, cell_size(3,c)-1
+                   do   j = -2, -1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(3)+p2)
+                         p2 = p2 + 1
+                      end do
+                   end do
+                end do
+
+             endif
+ 
+             if (cell_coord(2,c) .ne. ncells) then
+                do  k = 0, cell_size(3,c)-1
+                   do   j = cell_size(2,c), cell_size(2,c)+1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(2)+p3)
+                         p3 = p3 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(3,c) .ne. 1) then
+                do  k = -2, -1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(5)+p4)
+                         p4 = p4 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(3,c) .ne. ncells) then
+                do  k = cell_size(3,c), cell_size(3,c)+1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(4)+p5)
+                         p5 = p5 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+c---------------------------------------------------------------------
+c         m loop            
+c---------------------------------------------------------------------
+          end do
+
+c---------------------------------------------------------------------
+c      cells loop
+c---------------------------------------------------------------------
+       end do
+       if (timeron) call timer_stop(t_bpack)
+
+c---------------------------------------------------------------------
+c now that we have all the data, compute the rhs
+c---------------------------------------------------------------------
+       call compute_rhs
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/define.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/define.f
new file mode 100644
index 0000000..c465533
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/define.f
@@ -0,0 +1,66 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine compute_buffer_size(dim)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer  c, dim, face_size
+
+       if (ncells .eq. 1) return
+
+c---------------------------------------------------------------------
+c      compute the actual sizes of the buffers; note that there is 
+c      always one cell face that doesn't need buffer space, because it 
+c      is at the boundary of the grid
+c---------------------------------------------------------------------
+
+       west_size = 0
+       east_size = 0
+
+       do   c = 1, ncells
+          face_size = cell_size(2,c) * cell_size(3,c) * dim * 2
+          if (cell_coord(1,c).ne.1) west_size = west_size + face_size
+          if (cell_coord(1,c).ne.ncells) east_size = east_size + 
+     >                                                 face_size 
+       end do
+
+       north_size = 0
+       south_size = 0
+       do   c = 1, ncells
+          face_size = cell_size(1,c)*cell_size(3,c) * dim * 2
+          if (cell_coord(2,c).ne.1) south_size = south_size + face_size
+          if (cell_coord(2,c).ne.ncells) north_size = north_size + 
+     >                                                  face_size 
+       end do
+
+       top_size = 0
+       bottom_size = 0
+       do   c = 1, ncells
+          face_size = cell_size(1,c) * cell_size(2,c) * dim * 2
+          if (cell_coord(3,c).ne.1) bottom_size = bottom_size + 
+     >                                            face_size
+          if (cell_coord(3,c).ne.ncells) top_size = top_size +
+     >                                                face_size     
+       end do
+
+       start_send_west   = 1
+       start_send_east   = start_send_west   + west_size
+       start_send_south  = start_send_east   + east_size
+       start_send_north  = start_send_south  + south_size
+       start_send_bottom = start_send_north  + north_size
+       start_send_top    = start_send_bottom + bottom_size
+       start_recv_west   = 1
+       start_recv_east   = start_recv_west   + west_size
+       start_recv_south  = start_recv_east   + east_size
+       start_recv_north  = start_recv_south  + south_size
+       start_recv_bottom = start_recv_north  + north_size
+       start_recv_top    = start_recv_bottom + bottom_size
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/error.f
new file mode 100644
index 0000000..fd9aab3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/error.f
@@ -0,0 +1,105 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function computes the norm of the difference between the
+c computed solution and the exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer c, i, j, k, m, ii, jj, kk, d, error
+       double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5),
+     >                  add
+
+       do   m = 1, 5 
+          rms_work(m) = 0.0d0
+       end do
+
+       do   c = 1, ncells
+          kk = 0
+          do   k = cell_low(3,c), cell_high(3,c)
+             zeta = dble(k) * dnzm1
+             jj = 0
+             do   j = cell_low(2,c), cell_high(2,c)
+                eta = dble(j) * dnym1
+                ii = 0
+                do   i = cell_low(1,c), cell_high(1,c)
+                   xi = dble(i) * dnxm1
+                   call exact_solution(xi, eta, zeta, u_exact)
+
+                   do   m = 1, 5
+                      add = u(ii,jj,kk,m,c)-u_exact(m)
+                      rms_work(m) = rms_work(m) + add*add
+                   end do
+                   ii = ii + 1
+                end do
+                jj = jj + 1
+             end do
+             kk = kk + 1
+          end do
+       end do
+
+       call mpi_allreduce(rms_work, rms, 5, dp_type, 
+     >                 MPI_SUM, comm_setup, error)
+
+       do    m = 1, 5
+          do    d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
+
+       subroutine rhs_norm(rms)
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer c, i, j, k, d, m, error
+       double precision rms(5), rms_work(5), add
+
+       do    m = 1, 5
+          rms_work(m) = 0.0d0
+       end do
+
+       do   c = 1, ncells
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   do   m = 1, 5
+                      add = rhs(i,j,k,m,c)
+                      rms_work(m) = rms_work(m) + add*add
+                   end do
+                end do
+             end do
+          end do
+       end do
+
+
+
+       call mpi_allreduce(rms_work, rms, 5, dp_type, 
+     >                 MPI_SUM, comm_setup, error)
+
+       do   m = 1, 5
+          do   d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_rhs.f
new file mode 100644
index 0000000..b589668
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_rhs.f
@@ -0,0 +1,363 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision dtemp(5), xi, eta, zeta, dtpp
+       integer          c, m, i, j, k, ip1, im1, jp1, 
+     >                  jm1, km1, kp1
+
+c---------------------------------------------------------------------
+c loop over all cells owned by this node                   
+c---------------------------------------------------------------------
+       do   c = 1, ncells
+
+c---------------------------------------------------------------------
+c         initialize                                  
+c---------------------------------------------------------------------
+          do   m = 1, 5
+             do   k= 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      forcing(i,j,k,m,c) = 0.0d0
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c xi-direction flux differences                      
+c---------------------------------------------------------------------
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             zeta = dble(k+cell_low(3,c)) * dnzm1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                eta = dble(j+cell_low(2,c)) * dnym1
+
+                do  i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c)
+                   xi = dble(i+cell_low(1,c)) * dnxm1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do  m = 1, 5
+                      ue(i,m) = dtemp(m)
+                   end do
+
+                   dtpp = 1.0d0 / dtemp(1)
+
+                   do  m = 2, 5
+                      buf(i,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(i)   = buf(i,2) * buf(i,2)
+                   buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                        buf(i,4) * buf(i,4) 
+                   q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                           buf(i,4)*ue(i,4))
+
+                end do
+ 
+                do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   im1 = i-1
+                   ip1 = i+1
+
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - tx2 * (
+     >                (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                               buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+                end do
+
+c---------------------------------------------------------------------
+c Fourth-order dissipation                         
+c---------------------------------------------------------------------
+                if (start(1,c) .gt. 0) then
+                   do   m = 1, 5
+                      i = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                      i = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                     4.0d0*ue(i+1,m) +       ue(i+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                   end do
+                end do
+
+                if (end(1,c) .gt. 0) then
+                   do   m = 1, 5
+                      i = cell_size(1,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                      i = cell_size(1,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+                   end do
+                endif
+
+             end do
+          end do
+c---------------------------------------------------------------------
+c  eta-direction flux differences             
+c---------------------------------------------------------------------
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1          
+             zeta = dble(k+cell_low(3,c)) * dnzm1
+             do   i=start(1,c), cell_size(1,c)-end(1,c)-1
+                xi = dble(i+cell_low(1,c)) * dnxm1
+
+                do  j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c)
+                   eta = dble(j+cell_low(2,c)) * dnym1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do   m = 1, 5 
+                      ue(j,m) = dtemp(m)
+                   end do
+                   dtpp = 1.0d0/dtemp(1)
+
+                   do  m = 2, 5
+                      buf(j,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(j)   = buf(j,3) * buf(j,3)
+                   buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                        buf(j,4) * buf(j,4)
+                   q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                           buf(j,4)*ue(j,4))
+                end do
+
+                do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   jm1 = j-1
+                   jp1 = j+1
+                  
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -
+     >                ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - ty2*(
+     >                ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - ty2*(
+     >                (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - ty2*(
+     >                ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - ty2*(
+     >                buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                              buf(jm1,1))+
+     >                yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+                end do
+
+c---------------------------------------------------------------------
+c Fourth-order dissipation                      
+c---------------------------------------------------------------------
+                if (start(2,c) .gt. 0) then
+                   do   m = 1, 5
+                      j = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                      j = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                     4.0d0*ue(j+1,m) +       ue(j+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                   end do
+                end do
+                if (end(2,c) .gt. 0) then
+                   do   m = 1, 5
+                      j = cell_size(2,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                      j = cell_size(2,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+                   end do
+                endif
+
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c zeta-direction flux differences                      
+c---------------------------------------------------------------------
+          do  j=start(2,c), cell_size(2,c)-end(2,c)-1
+             eta = dble(j+cell_low(2,c)) * dnym1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                xi = dble(i+cell_low(1,c)) * dnxm1
+
+                do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c)
+                   zeta = dble(k+cell_low(3,c)) * dnzm1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do   m = 1, 5
+                      ue(k,m) = dtemp(m)
+                   end do
+
+                   dtpp = 1.0d0/dtemp(1)
+
+                   do   m = 2, 5
+                      buf(k,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(k)   = buf(k,4) * buf(k,4)
+                   buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + 
+     >                        buf(k,3) * buf(k,3)
+                   q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                           buf(k,4)*ue(k,4))
+                end do
+
+                do    k=start(3,c), cell_size(3,c)-end(3,c)-1
+                   km1 = k-1
+                   kp1 = k+1
+                  
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - tz2 * (
+     >                (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                              +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+                end do
+
+c---------------------------------------------------------------------
+c Fourth-order dissipation                        
+c---------------------------------------------------------------------
+                if (start(3,c) .gt. 0) then
+                   do   m = 1, 5
+                      k = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                      k = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                     4.0d0*ue(k+1,m) +       ue(k+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                   end do
+                end do
+
+                if (end(3,c) .gt. 0) then
+                   do    m = 1, 5
+                      k = cell_size(3,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                      k = cell_size(3,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                   end do
+                endif
+
+             end do
+          end do
+c---------------------------------------------------------------------
+c now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+          do   m = 1, 5
+             do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      forcing(i,j,k,m,c) = -1.d0 * forcing(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      cell loop
+c---------------------------------------------------------------------
+       end do
+
+       return
+       end
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_solution.f
new file mode 100644
index 0000000..2644f0b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/exact_solution.f
@@ -0,0 +1,30 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision  xi, eta, zeta, dtemp(5)
+       integer m
+
+       do  m = 1, 5
+          dtemp(m) =  ce(m,1) +
+     >    xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >    eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >    zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >    zeta*ce(m,13))))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/header.h
new file mode 100644
index 0000000..afbb22c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/header.h
@@ -0,0 +1,121 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      maxcells:      the square root of the maximum number of processors
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           ncells, grid_points(3)
+      common /global/   ncells, grid_points
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer           EAST, WEST, NORTH, SOUTH, 
+     >                  BOTTOM, TOP
+
+      parameter (EAST=2000, WEST=3000,      NORTH=4000, SOUTH=5000,
+     >           BOTTOM=6000, TOP=7000)
+
+      integer cell_coord (3,maxcells), cell_low (3,maxcells), 
+     >        cell_high  (3,maxcells), cell_size(3,maxcells),
+     >        predecessor(3),          slice    (3,maxcells),
+     >        grid_size  (3),          successor(3),
+     >        start      (3,maxcells), end      (3,maxcells)
+      common /partition/ cell_coord, cell_low, cell_high, cell_size,
+     >                   grid_size, successor, predecessor, slice,
+     >                   start, end
+
+      integer IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE, IMAXP, JMAXP
+
+      parameter (MAX_CELL_DIM = (problem_size/maxcells)+1)
+
+      parameter (IMAX=MAX_CELL_DIM,JMAX=MAX_CELL_DIM,KMAX=MAX_CELL_DIM)
+      parameter (IMAXP=IMAX/2*2+1,JMAXP=JMAX/2*2+1)
+
+c---------------------------------------------------------------------
+c +1 at end to avoid zero length arrays for 1 node
+c---------------------------------------------------------------------
+      parameter (BUF_SIZE=MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60*2+1)
+
+      double precision 
+     >   u       (-2:IMAXP+1,-2:JMAXP+1,-2:KMAX+1, 5,maxcells),
+     >   us      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   vs      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   ws      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   qs      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   ainv    (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   rho_i   (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   speed   (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   square  (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),
+     >   rhs     ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1, 5,maxcells),
+     >   forcing ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1, 5,maxcells),
+     >   lhs     ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1,15,maxcells),
+     >   in_buffer(BUF_SIZE), out_buffer(BUF_SIZE)
+      common /fields/  u, us, vs, ws, qs, ainv, rho_i, speed, square, 
+     >                 rhs, forcing, lhs, in_buffer, out_buffer
+
+      double precision cv(-2:MAX_CELL_DIM+1),   rhon(-2:MAX_CELL_DIM+1),
+     >                 rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1),
+     >                 cuf(-2:MAX_CELL_DIM+1),  q(-2:MAX_CELL_DIM+1),
+     >                 ue(-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5)
+      common /work_1d/ cv, rhon, rhos, rhoq, cuf, q, ue, buf
+
+      integer  west_size, east_size, bottom_size, top_size,
+     >         north_size, south_size, start_send_west, 
+     >         start_send_east, start_send_south, start_send_north,
+     >         start_send_bottom, start_send_top, start_recv_west,
+     >         start_recv_east, start_recv_south, start_recv_north,
+     >         start_recv_bottom, start_recv_top
+      common /box/ west_size, east_size, bottom_size,
+     >             top_size, north_size, south_size, 
+     >             start_send_west, start_send_east, start_send_south,
+     >             start_send_north, start_send_bottom, start_send_top,
+     >             start_recv_west, start_recv_east, start_recv_south,
+     >             start_recv_north, start_recv_bottom, start_recv_top
+
+      integer t_total, t_rhs, t_xsolve, t_ysolve, t_zsolve, t_bpack, 
+     >        t_exch, t_xcomm, t_ycomm, t_zcomm, t_last
+      parameter (t_total=1, t_rhs=2, t_xsolve=3, t_ysolve=4, 
+     >        t_zsolve=5, t_bpack=6, t_exch=7, t_xcomm=8, 
+     >        t_ycomm=9, t_zcomm=10, t_last=10)
+      logical timeron
+      common /tflags/ timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/initialize.f
new file mode 100644
index 0000000..655c8d9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/initialize.f
@@ -0,0 +1,286 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This subroutine initializes the field variable u using 
+c tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       integer c, i, j, k, m, ii, jj, kk, ix, iy, iz
+       double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >                   Pzeta, temp(5)
+
+
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+      do c = 1, ncells
+         do kk = -1, IMAX
+            do jj = -1, IMAX
+               do ii = -1, IMAX
+                  u(ii, jj, kk, 1, c) = 1.0
+                  u(ii, jj, kk, 2, c) = 0.0
+                  u(ii, jj, kk, 3, c) = 0.0
+                  u(ii, jj, kk, 4, c) = 0.0
+                  u(ii, jj, kk, 5, c) = 1.0
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+       do  c=1, ncells
+          kk = 0
+          do  k = cell_low(3,c), cell_high(3,c)
+             zeta = dble(k) * dnzm1
+             jj = 0
+             do  j = cell_low(2,c), cell_high(2,c)
+                eta = dble(j) * dnym1
+                ii = 0
+                do   i = cell_low(1,c), cell_high(1,c)
+                   xi = dble(i) * dnxm1
+                  
+                   do ix = 1, 2
+                      call exact_solution(dble(ix-1), eta, zeta, 
+     >                                    Pface(1,1,ix))
+                   end do
+
+                   do    iy = 1, 2
+                      call exact_solution(xi, dble(iy-1) , zeta, 
+     >                                    Pface(1,2,iy))
+                   end do
+
+                   do    iz = 1, 2
+                      call exact_solution(xi, eta, dble(iz-1),   
+     >                                    Pface(1,3,iz))
+                   end do
+
+                   do   m = 1, 5
+                      Pxi   = xi   * Pface(m,1,2) + 
+     >                        (1.0d0-xi)   * Pface(m,1,1)
+                      Peta  = eta  * Pface(m,2,2) + 
+     >                        (1.0d0-eta)  * Pface(m,2,1)
+                      Pzeta = zeta * Pface(m,3,2) + 
+     >                        (1.0d0-zeta) * Pface(m,3,1)
+ 
+                      u(ii,jj,kk,m,c) = Pxi + Peta + Pzeta - 
+     >                          Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                          Pxi*Peta*Pzeta
+
+                   end do
+                   ii = ii + 1
+                end do
+                jj = jj + 1
+             end do
+             kk = kk+1
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c west face                                                  
+c---------------------------------------------------------------------
+       c = slice(1,1)
+       ii = 0
+       xi = 0.0d0
+       kk = 0
+       do  k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          jj = 0
+          do   j = cell_low(2,c), cell_high(2,c)
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             jj = jj + 1
+          end do
+          kk = kk + 1
+       end do
+
+c---------------------------------------------------------------------
+c east face                                                      
+c---------------------------------------------------------------------
+       c  = slice(1,ncells)
+       ii = cell_size(1,c)-1
+       xi = 1.0d0
+       kk = 0
+       do   k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          jj = 0
+          do   j = cell_low(2,c), cell_high(2,c)
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             jj = jj + 1
+          end do
+          kk = kk + 1
+       end do
+
+c---------------------------------------------------------------------
+c south face                                                 
+c---------------------------------------------------------------------
+       c = slice(2,1)
+       jj = 0
+       eta = 0.0d0
+       kk = 0
+       do  k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          ii = 0
+          do   i = cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          kk = kk + 1
+       end do
+
+
+c---------------------------------------------------------------------
+c north face                                    
+c---------------------------------------------------------------------
+       c = slice(2,ncells)
+       jj = cell_size(2,c)-1
+       eta = 1.0d0
+       kk = 0
+       do   k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          ii = 0
+          do   i = cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          kk = kk + 1
+       end do
+
+c---------------------------------------------------------------------
+c bottom face                                       
+c---------------------------------------------------------------------
+       c = slice(3,1)
+       kk = 0
+       zeta = 0.0d0
+       jj = 0
+       do   j = cell_low(2,c), cell_high(2,c)
+          eta = dble(j) * dnym1
+          ii = 0
+          do   i =cell_low(1,c), cell_high(1,c)
+             xi = dble(i) *dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          jj = jj + 1
+       end do
+
+c---------------------------------------------------------------------
+c top face     
+c---------------------------------------------------------------------
+       c = slice(3,ncells)
+       kk = cell_size(3,c)-1
+       zeta = 1.0d0
+       jj = 0
+       do   j = cell_low(2,c), cell_high(2,c)
+          eta = dble(j) * dnym1
+          ii = 0
+          do   i =cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          jj = jj + 1
+       end do
+
+       return
+       end
+
+
+       subroutine lhsinit
+
+       include 'header.h'
+       
+       integer i, j, k, d, c, n
+
+c---------------------------------------------------------------------
+c loop over all cells                                       
+c---------------------------------------------------------------------
+       do  c = 1, ncells
+
+c---------------------------------------------------------------------
+c         first, initialize the start and end arrays
+c---------------------------------------------------------------------
+          do  d = 1, 3
+             if (cell_coord(d,c) .eq. 1) then
+                start(d,c) = 1
+             else 
+                start(d,c) = 0
+             endif
+             if (cell_coord(d,c) .eq. ncells) then
+                end(d,c) = 1
+             else
+                end(d,c) = 0
+             endif
+          end do
+
+c---------------------------------------------------------------------
+c     zap the whole left hand side for starters
+c---------------------------------------------------------------------
+          do  n = 1, 15
+             do  k = 0, cell_size(3,c)-1
+                do  j = 0, cell_size(2,c)-1
+                   do  i = 0, cell_size(1,c)-1
+                      lhs(i,j,k,n,c) = 0.0d0
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c next, set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+          do   n = 1, 3
+             do   k = 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      lhs(i,j,k,5*n-2,c) = 1.0d0
+                   end do
+                end do
+             end do
+          end do
+
+       end do
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/inputsp.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/inputsp.data.sample
new file mode 100644
index 0000000..ae3801f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/inputsp.data.sample
@@ -0,0 +1,3 @@
+400       number of time steps
+0.0015d0  dt for class A = 0.0015d0. class B = 0.001d0  class C = 0.00067d0
+64 64 64
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsx.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsx.f
new file mode 100644
index 0000000..cae7779
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsx.f
@@ -0,0 +1,124 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine lhsx(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This function computes the left hand side for the three x-factors  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision ru1
+       integer          i, j, k, c
+
+
+c---------------------------------------------------------------------
+c      treat only cell c             
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue                   
+c---------------------------------------------------------------------
+       do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do  i = start(1,c)-1, cell_size(1,c)-end(1,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(i) = us(i,j,k,c)
+                rhon(i) = dmax1(dx2+con43*ru1, 
+     >                          dx5+c1c5*ru1,
+     >                          dxmax+ru1,
+     >                          dx1)
+             end do
+
+             do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) =   0.0d0
+                lhs(i,j,k,2,c) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
+                lhs(i,j,k,3,c) =   1.0d0 + c2dttx1 * rhon(i)
+                lhs(i,j,k,4,c) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
+                lhs(i,j,k,5,c) =   0.0d0
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+       if (start(1,c) .gt. 0) then
+          i = 1
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+  
+                lhs(i+1,j,k,2,c) = lhs(i+1,j,k,2,c) - comz4
+                lhs(i+1,j,k,3,c) = lhs(i+1,j,k,3,c) + comz6
+                lhs(i+1,j,k,4,c) = lhs(i+1,j,k,4,c) - comz4
+                lhs(i+1,j,k,5,c) = lhs(i+1,j,k,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i=3*start(1,c), cell_size(1,c)-3*end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(1,c) .gt. 0) then
+          i = cell_size(1,c)-3
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i+1,j,k,1,c) = lhs(i+1,j,k,1,c) + comz1
+                lhs(i+1,j,k,2,c) = lhs(i+1,j,k,2,c) - comz4
+                lhs(i+1,j,k,3,c) = lhs(i+1,j,k,3,c) + comz5
+             end do
+          end do
+       endif
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) by a4ing to 
+c      the first  
+c---------------------------------------------------------------------
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) - 
+     >                            dttx2 * speed(i-1,j,k,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) + 
+     >                            dttx2 * speed(i+1,j,k,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) + 
+     >                            dttx2 * speed(i-1,j,k,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) - 
+     >                            dttx2 * speed(i+1,j,k,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsy.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsy.f
new file mode 100644
index 0000000..9c07a35
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsy.f
@@ -0,0 +1,125 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine lhsy(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This function computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision ru1
+       integer          i, j, k, c
+
+c---------------------------------------------------------------------
+c      treat only cell c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue         
+c---------------------------------------------------------------------
+       do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+             do  j = start(2,c)-1, cell_size(2,c)-end(2,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(j) = vs(i,j,k,c)
+                rhoq(j) = dmax1( dy3 + con43 * ru1,
+     >                           dy5 + c1c5*ru1,
+     >                           dymax + ru1,
+     >                           dy1)
+             end do
+            
+             do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,1,c) =  0.0d0
+                lhs(i,j,k,2,c) = -dtty2 * cv(j-1) - dtty1 * rhoq(j-1)
+                lhs(i,j,k,3,c) =  1.0 + c2dtty1 * rhoq(j)
+                lhs(i,j,k,4,c) =  dtty2 * cv(j+1) - dtty1 * rhoq(j+1)
+                lhs(i,j,k,5,c) =  0.0d0
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+       if (start(2,c) .gt. 0) then
+          j = 1
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+       
+                lhs(i,j+1,k,2,c) = lhs(i,j+1,k,2,c) - comz4
+                lhs(i,j+1,k,3,c) = lhs(i,j+1,k,3,c) + comz6
+                lhs(i,j+1,k,4,c) = lhs(i,j+1,k,4,c) - comz4
+                lhs(i,j+1,k,5,c) = lhs(i,j+1,k,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j=3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(2,c) .gt. 0) then
+          j = cell_size(2,c)-3
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i,j+1,k,1,c) = lhs(i,j+1,k,1,c) + comz1
+                lhs(i,j+1,k,2,c) = lhs(i,j+1,k,2,c) - comz4
+                lhs(i,j+1,k,3,c) = lhs(i,j+1,k,3,c) + comz5
+             end do
+          end do
+       endif
+
+c---------------------------------------------------------------------
+c      subsequently, do the other two factors                    
+c---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) - 
+     >                            dtty2 * speed(i,j-1,k,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) + 
+     >                            dtty2 * speed(i,j+1,k,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) + 
+     >                            dtty2 * speed(i,j-1,k,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) - 
+     >                            dtty2 * speed(i,j+1,k,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsz.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsz.f
new file mode 100644
index 0000000..08ea0bc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/lhsz.f
@@ -0,0 +1,123 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine lhsz(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This function computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision ru1
+       integer i, j, k, c
+
+c---------------------------------------------------------------------
+c      treat only cell c                                         
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c first fill the lhs for the u-eigenvalue                          
+c---------------------------------------------------------------------
+       do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+          do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+             do   k = start(3,c)-1, cell_size(3,c)-end(3,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(k) = ws(i,j,k,c)
+                rhos(k) = dmax1(dz4 + con43 * ru1,
+     >                          dz5 + c1c5 * ru1,
+     >                          dzmax + ru1,
+     >                          dz1)
+             end do
+
+             do   k =  start(3,c), cell_size(3,c)-end(3,c)-1
+                lhs(i,j,k,1,c) =  0.0d0
+                lhs(i,j,k,2,c) = -dttz2 * cv(k-1) - dttz1 * rhos(k-1)
+                lhs(i,j,k,3,c) =  1.0 + c2dttz1 * rhos(k)
+                lhs(i,j,k,4,c) =  dttz2 * cv(k+1) - dttz1 * rhos(k+1)
+                lhs(i,j,k,5,c) =  0.0d0
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                                  
+c---------------------------------------------------------------------
+       if (start(3,c) .gt. 0) then
+          k = 1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+
+                lhs(i,j,k+1,2,c) = lhs(i,j,k+1,2,c) - comz4
+                lhs(i,j,k+1,3,c) = lhs(i,j,k+1,3,c) + comz6
+                lhs(i,j,k+1,4,c) = lhs(i,j,k+1,4,c) - comz4
+                lhs(i,j,k+1,5,c) = lhs(i,j,k+1,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do    k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(3,c) .gt. 0) then
+          k = cell_size(3,c)-3 
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i,j,k+1,1,c) = lhs(i,j,k+1,1,c) + comz1
+                lhs(i,j,k+1,2,c) = lhs(i,j,k+1,2,c) - comz4
+                lhs(i,j,k+1,3,c) = lhs(i,j,k+1,3,c) + comz5
+             end do
+          end do
+       endif
+
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) 
+c---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) - 
+     >                            dttz2 * speed(i,j,k-1,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) + 
+     >                            dttz2 * speed(i,j,k+1,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) + 
+     >                            dttz2 * speed(i,j,k-1,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) - 
+     >                            dttz2 * speed(i,j,k+1,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/make_set.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/make_set.f
new file mode 100644
index 0000000..d3f3b36
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/make_set.f
@@ -0,0 +1,121 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine make_set
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This function allocates space for a set of cells and fills the set     
+c such that communication between cells on different nodes is only
+c nearest neighbor                                                   
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer p, i, j, c, dir, size, excess, ierr,ierrcode
+
+c---------------------------------------------------------------------
+c     compute square root; add small number to allow for roundoff
+c     (note: this is computed in setup_mpi.f also, but prefer to do
+c     it twice because of some include file problems).
+c---------------------------------------------------------------------
+      ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0))
+
+c---------------------------------------------------------------------
+c      this makes coding easier
+c---------------------------------------------------------------------
+       p = ncells
+   
+c---------------------------------------------------------------------
+c      determine the location of the cell at the bottom of the 3D 
+c      array of cells
+c---------------------------------------------------------------------
+       cell_coord(1,1) = mod(node,p) 
+       cell_coord(2,1) = node/p 
+       cell_coord(3,1) = 0
+
+c---------------------------------------------------------------------
+c      set the cell_coords for cells in the rest of the z-layers; 
+c      this comes down to a simple linear numbering in the z-direct-
+c      ion, and to the doubly-cyclic numbering in the other dirs     
+c---------------------------------------------------------------------
+       do    c=2, p
+          cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) 
+          cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) 
+          cell_coord(3,c) = c-1
+       end do
+
+c---------------------------------------------------------------------
+c      offset all the coordinates by 1 to adjust for Fortran arrays
+c---------------------------------------------------------------------
+       do    dir = 1, 3
+          do    c = 1, p
+             cell_coord(dir,c) = cell_coord(dir,c) + 1
+          end do
+       end do
+   
+c---------------------------------------------------------------------
+c      slice(dir,n) contains the sequence number of the cell that is in
+c      coordinate plane n in the dir direction
+c---------------------------------------------------------------------
+       do   dir = 1, 3
+          do   c = 1, p
+             slice(dir,cell_coord(dir,c)) = c
+          end do
+       end do
+
+
+c---------------------------------------------------------------------
+c      fill the predecessor and successor entries, using the indices 
+c      of the bottom cells (they are the same at each level of k 
+c      anyway) acting as if full periodicity pertains; note that p is
+c      added to those arguments to the mod functions that might
+c      otherwise return wrong values when using the modulo function
+c---------------------------------------------------------------------
+       i = cell_coord(1,1)-1
+       j = cell_coord(2,1)-1
+
+       predecessor(1) = mod(i-1+p,p) + p*j
+       predecessor(2) = i + p*mod(j-1+p,p)
+       predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p)
+       successor(1)   = mod(i+1,p) + p*j
+       successor(2)   = i + p*mod(j+1,p)
+       successor(3)   = mod(i-1+p,p) + p*mod(j+1,p)
+
+c---------------------------------------------------------------------
+c now compute the sizes of the cells                                    
+c---------------------------------------------------------------------
+       do    dir= 1, 3
+c---------------------------------------------------------------------
+c         set cell_coord range for each direction                            
+c---------------------------------------------------------------------
+          size   = grid_points(dir)/p
+          excess = mod(grid_points(dir),p)
+          do    c=1, ncells
+             if (cell_coord(dir,c) .le. excess) then
+                cell_size(dir,c) = size+1
+                cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1)
+                cell_high(dir,c) = cell_low(dir,c)+size
+             else 
+                cell_size(dir,c) = size
+                cell_low(dir,c)  = excess*(size+1)+
+     >                   (cell_coord(dir,c)-excess-1)*size
+                cell_high(dir,c) = cell_low(dir,c)+size-1
+             endif
+             if (cell_size(dir, c) .le. 2) then
+                write(*,50)
+ 50             format(' Error: Cell size too small. Min size is 3')
+                ierrcode = 1
+                call MPI_Abort(mpi_comm_world,ierrcode,ierr)
+                stop
+             endif
+          end do
+       end do
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/mpinpb.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/mpinpb.h
new file mode 100644
index 0000000..439db34
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/mpinpb.h
@@ -0,0 +1,13 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'mpif.h'
+
+      integer           node, no_nodes, total_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type
+      logical           active
+      common /mpistuff/ node, no_nodes, total_nodes, root, comm_setup, 
+     >                  comm_solve, comm_rhs, dp_type, active
+      integer           DEFAULT_TAG
+      parameter         (DEFAULT_TAG = 0)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/ninvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/ninvr.f
new file mode 100644
index 0000000..146d046
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/ninvr.f
@@ -0,0 +1,45 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  ninvr(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication              
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer  c,  i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+c---------------------------------------------------------------------
+c      treat only one cell                           
+c---------------------------------------------------------------------
+       do k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)
+               
+                t1 = bt * r3
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(i,j,k,1,c) = -r2
+                rhs(i,j,k,2,c) =  r1
+                rhs(i,j,k,3,c) = bt * ( r4 - r5 )
+                rhs(i,j,k,4,c) = -t1 + t2
+                rhs(i,j,k,5,c) =  t1 + t2
+             enddo    
+          enddo
+       enddo
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/pinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/pinvr.f
new file mode 100644
index 0000000..060f0a5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/pinvr.f
@@ -0,0 +1,48 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine pinvr(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, c
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+c---------------------------------------------------------------------
+c      treat only one cell                                   
+c---------------------------------------------------------------------
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)
+
+                t1 = bt * r1
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(i,j,k,1,c) =  bt * ( r4 - r5 )
+                rhs(i,j,k,2,c) = -r3
+                rhs(i,j,k,3,c) =  r2
+                rhs(i,j,k,4,c) = -t1 + t2
+                rhs(i,j,k,5,c) =  t1 + t2
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/rhs.f
new file mode 100644
index 0000000..b687f12
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/rhs.f
@@ -0,0 +1,449 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer c, i, j, k, m
+       double precision aux, rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >                  wijk, wp1, wm1
+
+
+       if (timeron) call timer_start(t_rhs)
+c---------------------------------------------------------------------
+c loop over all cells owned by this node                           
+c---------------------------------------------------------------------
+       do    c = 1, ncells
+
+c---------------------------------------------------------------------
+c         compute the reciprocal of density, and the kinetic energy, 
+c         and the speed of sound. 
+c---------------------------------------------------------------------
+
+          do    k = -1, cell_size(3,c)
+             do    j = -1, cell_size(2,c)
+                do    i = -1, cell_size(1,c)
+                   rho_inv = 1.0d0/u(i,j,k,1,c)
+                   rho_i(i,j,k,c) = rho_inv
+                   us(i,j,k,c) = u(i,j,k,2,c) * rho_inv
+                   vs(i,j,k,c) = u(i,j,k,3,c) * rho_inv
+                   ws(i,j,k,c) = u(i,j,k,4,c) * rho_inv
+                   square(i,j,k,c)     = 0.5d0* (
+     >                        u(i,j,k,2,c)*u(i,j,k,2,c) + 
+     >                        u(i,j,k,3,c)*u(i,j,k,3,c) +
+     >                        u(i,j,k,4,c)*u(i,j,k,4,c) ) * rho_inv
+                   qs(i,j,k,c) = square(i,j,k,c) * rho_inv
+c---------------------------------------------------------------------
+c                  (don't need speed and ainx until the lhs computation)
+c---------------------------------------------------------------------
+                   aux = c1c2*rho_inv* (u(i,j,k,5,c) - square(i,j,k,c))
+                   aux = dsqrt(aux)
+                   speed(i,j,k,c) = aux
+                   ainv(i,j,k,c)  = 1.0d0/aux
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole of every 
+c cell,  including the boundary                   
+c---------------------------------------------------------------------
+
+          do   m = 1, 5
+             do   k = 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      rhs(i,j,k,m,c) = forcing(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c         compute xi-direction fluxes 
+c---------------------------------------------------------------------
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   uijk = us(i,j,k,c)
+                   up1  = us(i+1,j,k,c)
+                   um1  = us(i-1,j,k,c)
+
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dx1tx1 * 
+     >                    (u(i+1,j,k,1,c) - 2.0d0*u(i,j,k,1,c) + 
+     >                     u(i-1,j,k,1,c)) -
+     >                    tx2 * (u(i+1,j,k,2,c) - u(i-1,j,k,2,c))
+
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dx2tx1 * 
+     >                    (u(i+1,j,k,2,c) - 2.0d0*u(i,j,k,2,c) + 
+     >                     u(i-1,j,k,2,c)) +
+     >                    xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                    tx2 * (u(i+1,j,k,2,c)*up1 - 
+     >                           u(i-1,j,k,2,c)*um1 +
+     >                           (u(i+1,j,k,5,c)- square(i+1,j,k,c)-
+     >                            u(i-1,j,k,5,c)+ square(i-1,j,k,c))*
+     >                            c2)
+
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dx3tx1 * 
+     >                    (u(i+1,j,k,3,c) - 2.0d0*u(i,j,k,3,c) +
+     >                     u(i-1,j,k,3,c)) +
+     >                    xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) +
+     >                              vs(i-1,j,k,c)) -
+     >                    tx2 * (u(i+1,j,k,3,c)*up1 - 
+     >                           u(i-1,j,k,3,c)*um1)
+
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dx4tx1 * 
+     >                    (u(i+1,j,k,4,c) - 2.0d0*u(i,j,k,4,c) +
+     >                     u(i-1,j,k,4,c)) +
+     >                    xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) +
+     >                              ws(i-1,j,k,c)) -
+     >                    tx2 * (u(i+1,j,k,4,c)*up1 - 
+     >                           u(i-1,j,k,4,c)*um1)
+
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dx5tx1 * 
+     >                    (u(i+1,j,k,5,c) - 2.0d0*u(i,j,k,5,c) +
+     >                     u(i-1,j,k,5,c)) +
+     >                    xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) +
+     >                              qs(i-1,j,k,c)) +
+     >                    xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                              um1*um1) +
+     >                    xxcon5 * (u(i+1,j,k,5,c)*rho_i(i+1,j,k,c) - 
+     >                              2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +
+     >                              u(i-1,j,k,5,c)*rho_i(i-1,j,k,c)) -
+     >                    tx2 * ( (c1*u(i+1,j,k,5,c) - 
+     >                             c2*square(i+1,j,k,c))*up1 -
+     >                            (c1*u(i-1,j,k,5,c) - 
+     >                             c2*square(i-1,j,k,c))*um1 )
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+          if (start(1,c) .gt. 0) then
+             i = 1
+             do    m = 1, 5
+                do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp * 
+     >                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) +
+     >                            u(i+2,j,k,m,c))
+                   end do
+                end do
+             end do
+
+             i = 2
+             do    m = 1, 5
+                do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (-4.0d0*u(i-1,j,k,m,c) + 6.0d0*u(i,j,k,m,c) -
+     >                      4.0d0*u(i+1,j,k,m,c) + u(i+2,j,k,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do  i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (  u(i-2,j,k,m,c) - 4.0d0*u(i-1,j,k,m,c) + 
+     >                     6.0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) + 
+     >                         u(i+2,j,k,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+
+          if (end(1,c) .gt. 0) then
+             i = cell_size(1,c)-3
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i-2,j,k,m,c) - 4.0d0*u(i-1,j,k,m,c) + 
+     >                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) )
+                   end do
+                end do
+             end do
+
+             i = cell_size(1,c)-2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i-2,j,k,m,c) - 4.d0*u(i-1,j,k,m,c) +
+     >                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+c---------------------------------------------------------------------
+c         compute eta-direction fluxes 
+c---------------------------------------------------------------------
+          do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   vijk = vs(i,j,k,c)
+                   vp1  = vs(i,j+1,k,c)
+                   vm1  = vs(i,j-1,k,c)
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dy1ty1 * 
+     >                   (u(i,j+1,k,1,c) - 2.0d0*u(i,j,k,1,c) + 
+     >                    u(i,j-1,k,1,c)) -
+     >                   ty2 * (u(i,j+1,k,3,c) - u(i,j-1,k,3,c))
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dy2ty1 * 
+     >                   (u(i,j+1,k,2,c) - 2.0d0*u(i,j,k,2,c) + 
+     >                    u(i,j-1,k,2,c)) +
+     >                   yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) + 
+     >                             us(i,j-1,k,c)) -
+     >                   ty2 * (u(i,j+1,k,2,c)*vp1 - 
+     >                          u(i,j-1,k,2,c)*vm1)
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dy3ty1 * 
+     >                   (u(i,j+1,k,3,c) - 2.0d0*u(i,j,k,3,c) + 
+     >                    u(i,j-1,k,3,c)) +
+     >                   yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                   ty2 * (u(i,j+1,k,3,c)*vp1 - 
+     >                          u(i,j-1,k,3,c)*vm1 +
+     >                          (u(i,j+1,k,5,c) - square(i,j+1,k,c) - 
+     >                           u(i,j-1,k,5,c) + square(i,j-1,k,c))
+     >                          *c2)
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dy4ty1 * 
+     >                   (u(i,j+1,k,4,c) - 2.0d0*u(i,j,k,4,c) + 
+     >                    u(i,j-1,k,4,c)) +
+     >                   yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) + 
+     >                             ws(i,j-1,k,c)) -
+     >                   ty2 * (u(i,j+1,k,4,c)*vp1 - 
+     >                          u(i,j-1,k,4,c)*vm1)
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dy5ty1 * 
+     >                   (u(i,j+1,k,5,c) - 2.0d0*u(i,j,k,5,c) + 
+     >                    u(i,j-1,k,5,c)) +
+     >                   yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) + 
+     >                             qs(i,j-1,k,c)) +
+     >                   yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                             vm1*vm1) +
+     >                   yycon5 * (u(i,j+1,k,5,c)*rho_i(i,j+1,k,c) - 
+     >                             2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +
+     >                             u(i,j-1,k,5,c)*rho_i(i,j-1,k,c)) -
+     >                   ty2 * ((c1*u(i,j+1,k,5,c) - 
+     >                           c2*square(i,j+1,k,c)) * vp1 -
+     >                          (c1*u(i,j-1,k,5,c) - 
+     >                           c2*square(i,j-1,k,c)) * vm1)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+          if (start(2,c) .gt. 0) then
+             j = 1
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp * 
+     >                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) +
+     >                            u(i,j+2,k,m,c))
+                   end do
+                end do
+             end do
+
+             j = 2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (-4.0d0*u(i,j-1,k,m,c) + 6.0d0*u(i,j,k,m,c) -
+     >                      4.0d0*u(i,j+1,k,m,c) + u(i,j+2,k,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do    j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+                   do  i = start(1,c),cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (  u(i,j-2,k,m,c) - 4.0d0*u(i,j-1,k,m,c) + 
+     >                     6.0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) + 
+     >                         u(i,j+2,k,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+          if (end(2,c) .gt. 0) then
+             j = cell_size(2,c)-3
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i,j-2,k,m,c) - 4.0d0*u(i,j-1,k,m,c) + 
+     >                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) )
+                   end do
+                end do
+             end do
+
+             j = cell_size(2,c)-2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i,j-2,k,m,c) - 4.d0*u(i,j-1,k,m,c) +
+     >                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+
+c---------------------------------------------------------------------
+c         compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   wijk = ws(i,j,k,c)
+                   wp1  = ws(i,j,k+1,c)
+                   wm1  = ws(i,j,k-1,c)
+
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dz1tz1 * 
+     >                   (u(i,j,k+1,1,c) - 2.0d0*u(i,j,k,1,c) + 
+     >                    u(i,j,k-1,1,c)) -
+     >                   tz2 * (u(i,j,k+1,4,c) - u(i,j,k-1,4,c))
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dz2tz1 * 
+     >                   (u(i,j,k+1,2,c) - 2.0d0*u(i,j,k,2,c) + 
+     >                    u(i,j,k-1,2,c)) +
+     >                   zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) + 
+     >                             us(i,j,k-1,c)) -
+     >                   tz2 * (u(i,j,k+1,2,c)*wp1 - 
+     >                          u(i,j,k-1,2,c)*wm1)
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dz3tz1 * 
+     >                   (u(i,j,k+1,3,c) - 2.0d0*u(i,j,k,3,c) + 
+     >                    u(i,j,k-1,3,c)) +
+     >                   zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) + 
+     >                             vs(i,j,k-1,c)) -
+     >                   tz2 * (u(i,j,k+1,3,c)*wp1 - 
+     >                          u(i,j,k-1,3,c)*wm1)
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dz4tz1 * 
+     >                   (u(i,j,k+1,4,c) - 2.0d0*u(i,j,k,4,c) + 
+     >                    u(i,j,k-1,4,c)) +
+     >                   zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                   tz2 * (u(i,j,k+1,4,c)*wp1 - 
+     >                          u(i,j,k-1,4,c)*wm1 +
+     >                          (u(i,j,k+1,5,c) - square(i,j,k+1,c) - 
+     >                           u(i,j,k-1,5,c) + square(i,j,k-1,c))
+     >                          *c2)
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dz5tz1 * 
+     >                   (u(i,j,k+1,5,c) - 2.0d0*u(i,j,k,5,c) + 
+     >                    u(i,j,k-1,5,c)) +
+     >                   zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) + 
+     >                             qs(i,j,k-1,c)) +
+     >                   zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                             wm1*wm1) +
+     >                   zzcon5 * (u(i,j,k+1,5,c)*rho_i(i,j,k+1,c) - 
+     >                             2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +
+     >                             u(i,j,k-1,5,c)*rho_i(i,j,k-1,c)) -
+     >                   tz2 * ( (c1*u(i,j,k+1,5,c) - 
+     >                            c2*square(i,j,k+1,c))*wp1 -
+     >                           (c1*u(i,j,k-1,5,c) - 
+     >                            c2*square(i,j,k-1,c))*wm1)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+          if (start(3,c) .gt. 0) then
+             k = 1
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp * 
+     >                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) +
+     >                            u(i,j,k+2,m,c))
+                   end do
+                end do
+             end do
+
+             k = 2
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (-4.0d0*u(i,j,k-1,m,c) + 6.0d0*u(i,j,k,m,c) -
+     >                      4.0d0*u(i,j,k+1,m,c) + u(i,j,k+2,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c),cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp * 
+     >                    (  u(i,j,k-2,m,c) - 4.0d0*u(i,j,k-1,m,c) + 
+     >                     6.0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) + 
+     >                         u(i,j,k+2,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+          if (end(3,c) .gt. 0) then
+             k = cell_size(3,c)-3
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i,j,k-2,m,c) - 4.0d0*u(i,j,k-1,m,c) + 
+     >                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) )
+                   end do
+                end do
+             end do
+
+             k = cell_size(3,c)-2
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *
+     >                    ( u(i,j,k-2,m,c) - 4.d0*u(i,j,k-1,m,c) +
+     >                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) * dt
+                   end do
+                end do
+             end do
+          end do
+
+       end do
+    
+       if (timeron) call timer_stop(t_rhs)
+
+       return
+       end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/set_constants.f
new file mode 100644
index 0000000..63ce72b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/set_constants.f
@@ -0,0 +1,203 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       ce(1,1)  = 2.0d0
+       ce(1,2)  = 0.0d0
+       ce(1,3)  = 0.0d0
+       ce(1,4)  = 4.0d0
+       ce(1,5)  = 5.0d0
+       ce(1,6)  = 3.0d0
+       ce(1,7)  = 0.5d0
+       ce(1,8)  = 0.02d0
+       ce(1,9)  = 0.01d0
+       ce(1,10) = 0.03d0
+       ce(1,11) = 0.5d0
+       ce(1,12) = 0.4d0
+       ce(1,13) = 0.3d0
+ 
+       ce(2,1)  = 1.0d0
+       ce(2,2)  = 0.0d0
+       ce(2,3)  = 0.0d0
+       ce(2,4)  = 0.0d0
+       ce(2,5)  = 1.0d0
+       ce(2,6)  = 2.0d0
+       ce(2,7)  = 3.0d0
+       ce(2,8)  = 0.01d0
+       ce(2,9)  = 0.03d0
+       ce(2,10) = 0.02d0
+       ce(2,11) = 0.4d0
+       ce(2,12) = 0.3d0
+       ce(2,13) = 0.5d0
+
+       ce(3,1)  = 2.0d0
+       ce(3,2)  = 2.0d0
+       ce(3,3)  = 0.0d0
+       ce(3,4)  = 0.0d0
+       ce(3,5)  = 0.0d0
+       ce(3,6)  = 2.0d0
+       ce(3,7)  = 3.0d0
+       ce(3,8)  = 0.04d0
+       ce(3,9)  = 0.03d0
+       ce(3,10) = 0.05d0
+       ce(3,11) = 0.3d0
+       ce(3,12) = 0.5d0
+       ce(3,13) = 0.4d0
+
+       ce(4,1)  = 2.0d0
+       ce(4,2)  = 2.0d0
+       ce(4,3)  = 0.0d0
+       ce(4,4)  = 0.0d0
+       ce(4,5)  = 0.0d0
+       ce(4,6)  = 2.0d0
+       ce(4,7)  = 3.0d0
+       ce(4,8)  = 0.03d0
+       ce(4,9)  = 0.05d0
+       ce(4,10) = 0.04d0
+       ce(4,11) = 0.2d0
+       ce(4,12) = 0.1d0
+       ce(4,13) = 0.3d0
+
+       ce(5,1)  = 5.0d0
+       ce(5,2)  = 4.0d0
+       ce(5,3)  = 3.0d0
+       ce(5,4)  = 2.0d0
+       ce(5,5)  = 0.1d0
+       ce(5,6)  = 0.4d0
+       ce(5,7)  = 0.3d0
+       ce(5,8)  = 0.05d0
+       ce(5,9)  = 0.04d0
+       ce(5,10) = 0.03d0
+       ce(5,11) = 0.1d0
+       ce(5,12) = 0.3d0
+       ce(5,13) = 0.2d0
+
+       c1 = 1.4d0
+       c2 = 0.4d0
+       c3 = 0.1d0
+       c4 = 1.0d0
+       c5 = 1.4d0
+
+       bt = dsqrt(0.5d0)
+
+       dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+       dnym1 = 1.0d0 / dble(grid_points(2)-1)
+       dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+       c1c2 = c1 * c2
+       c1c5 = c1 * c5
+       c3c4 = c3 * c4
+       c1345 = c1c5 * c3c4
+
+       conz1 = (1.0d0-c1c5)
+
+       tx1 = 1.0d0 / (dnxm1 * dnxm1)
+       tx2 = 1.0d0 / (2.0d0 * dnxm1)
+       tx3 = 1.0d0 / dnxm1
+
+       ty1 = 1.0d0 / (dnym1 * dnym1)
+       ty2 = 1.0d0 / (2.0d0 * dnym1)
+       ty3 = 1.0d0 / dnym1
+ 
+       tz1 = 1.0d0 / (dnzm1 * dnzm1)
+       tz2 = 1.0d0 / (2.0d0 * dnzm1)
+       tz3 = 1.0d0 / dnzm1
+
+       dx1 = 0.75d0
+       dx2 = 0.75d0
+       dx3 = 0.75d0
+       dx4 = 0.75d0
+       dx5 = 0.75d0
+
+       dy1 = 0.75d0
+       dy2 = 0.75d0
+       dy3 = 0.75d0
+       dy4 = 0.75d0
+       dy5 = 0.75d0
+
+       dz1 = 1.0d0
+       dz2 = 1.0d0
+       dz3 = 1.0d0
+       dz4 = 1.0d0
+       dz5 = 1.0d0
+
+       dxmax = dmax1(dx3, dx4)
+       dymax = dmax1(dy2, dy4)
+       dzmax = dmax1(dz2, dz3)
+
+       dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+       c4dssp = 4.0d0 * dssp
+       c5dssp = 5.0d0 * dssp
+
+       dttx1 = dt*tx1
+       dttx2 = dt*tx2
+       dtty1 = dt*ty1
+       dtty2 = dt*ty2
+       dttz1 = dt*tz1
+       dttz2 = dt*tz2
+
+       c2dttx1 = 2.0d0*dttx1
+       c2dtty1 = 2.0d0*dtty1
+       c2dttz1 = 2.0d0*dttz1
+
+       dtdssp = dt*dssp
+
+       comz1  = dtdssp
+       comz4  = 4.0d0*dtdssp
+       comz5  = 5.0d0*dtdssp
+       comz6  = 6.0d0*dtdssp
+
+       c3c4tx3 = c3c4*tx3
+       c3c4ty3 = c3c4*ty3
+       c3c4tz3 = c3c4*tz3
+
+       dx1tx1 = dx1*tx1
+       dx2tx1 = dx2*tx1
+       dx3tx1 = dx3*tx1
+       dx4tx1 = dx4*tx1
+       dx5tx1 = dx5*tx1
+        
+       dy1ty1 = dy1*ty1
+       dy2ty1 = dy2*ty1
+       dy3ty1 = dy3*ty1
+       dy4ty1 = dy4*ty1
+       dy5ty1 = dy5*ty1
+        
+       dz1tz1 = dz1*tz1
+       dz2tz1 = dz2*tz1
+       dz3tz1 = dz3*tz1
+       dz4tz1 = dz4*tz1
+       dz5tz1 = dz5*tz1
+
+       c2iv  = 2.5d0
+       con43 = 4.0d0/3.0d0
+       con16 = 1.0d0/6.0d0
+        
+       xxcon1 = c3c4tx3*con43*tx3
+       xxcon2 = c3c4tx3*tx3
+       xxcon3 = c3c4tx3*conz1*tx3
+       xxcon4 = c3c4tx3*con16*tx3
+       xxcon5 = c3c4tx3*c1c5*tx3
+
+       yycon1 = c3c4ty3*con43*ty3
+       yycon2 = c3c4ty3*ty3
+       yycon3 = c3c4ty3*conz1*ty3
+       yycon4 = c3c4ty3*con16*ty3
+       yycon5 = c3c4ty3*c1c5*ty3
+
+       zzcon1 = c3c4tz3*con43*tz3
+       zzcon2 = c3c4tz3*tz3
+       zzcon3 = c3c4tz3*conz1*tz3
+       zzcon4 = c3c4tz3*con16*tz3
+       zzcon5 = c3c4tz3*c1c5*tz3
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/setup_mpi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/setup_mpi.f
new file mode 100644
index 0000000..2d98f7d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/setup_mpi.f
@@ -0,0 +1,65 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup_mpi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c set up MPI stuff
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'mpinpb.h'
+      include 'npbparams.h'
+      integer error, nc, color
+
+      call mpi_init(error)
+      
+      call mpi_comm_size(MPI_COMM_WORLD, total_nodes, error)
+      call mpi_comm_rank(MPI_COMM_WORLD, node, error)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+c---------------------------------------------------------------------
+c     compute square root; add small number to allow for roundoff
+c---------------------------------------------------------------------
+      nc = dint(dsqrt(dble(total_nodes) + 0.00001d0))
+
+c---------------------------------------------------------------------
+c We handle a non-square number of nodes by making the excess nodes
+c inactive. However, we can never handle more cells than were compiled
+c in. 
+c---------------------------------------------------------------------
+
+      if (nc .gt. maxcells) nc = maxcells
+
+      if (node .ge. nc*nc) then
+         active = .false.
+         color = 1
+      else
+         active = .true.
+         color = 0
+      end if
+      
+      call mpi_comm_split(MPI_COMM_WORLD,color,node,comm_setup,error)
+      if (.not. active) return
+
+      call mpi_comm_size(comm_setup, no_nodes, error)
+      call mpi_comm_dup(comm_setup, comm_solve, error)
+      call mpi_comm_dup(comm_setup, comm_rhs, error)
+      
+c---------------------------------------------------------------------
+c     let node 0 be the root for the group (there is only one)
+c---------------------------------------------------------------------
+      root = 0
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/sp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/sp.f
new file mode 100644
index 0000000..29ea282
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/sp.f
@@ -0,0 +1,245 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                                   S P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.3 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: R. F. Van der Wijngaart
+c          W. Saphir
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program MPSP
+c---------------------------------------------------------------------
+
+       include  'header.h'
+       include  'mpinpb.h'
+      
+       integer          i, niter, step, c, error, fstatus
+       external timer_read
+       double precision mflops, t, tmax, timer_read
+       logical          verified
+       character        class
+       double precision tsum(t_last+2), t1(t_last+2),
+     >                  tming(t_last+2), tmaxg(t_last+2)
+       character        t_recs(t_last+2)*8
+
+       data t_recs/'total', 'rhs', 'xsolve', 'ysolve', 'zsolve', 
+     >             'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm',
+     >             ' totcomp', ' totcomm'/
+
+       call setup_mpi
+       if (.not. active) goto 999
+
+c---------------------------------------------------------------------
+c      Root node reads input file (if it exists) else takes
+c      defaults from parameters
+c---------------------------------------------------------------------
+       if (node .eq. root) then
+          
+          write(*, 1000)
+
+          open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+          timeron = .false.
+          if (fstatus .eq. 0) then
+             timeron = .true.
+             close(2)
+          endif
+
+          open (unit=2,file='inputsp.data',status='old', iostat=fstatus)
+c
+          if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputsp.data')
+            read (2,*) niter
+            read (2,*) dt
+            read (2,*) grid_points(1), grid_points(2), grid_points(3)
+            close(2)
+          else
+            write(*,234) 
+            niter = niter_default
+            dt    = dt_default
+            grid_points(1) = problem_size
+            grid_points(2) = problem_size
+            grid_points(3) = problem_size
+          endif
+ 234      format(' No input file inputsp.data. Using compiled defaults')
+
+          write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+          write(*, 1002) niter, dt
+          if (no_nodes .ne. total_nodes) write(*, 1004) total_nodes
+          if (no_nodes .ne. maxcells*maxcells) 
+     >        write(*, 1005) maxcells*maxcells
+          write(*, 1003) no_nodes
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.3 -- SP Benchmark',/)
+ 1001     format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002     format(' Iterations: ', i4, '    dt: ', F11.7)
+ 1004     format(' Total number of processes: ', i5)
+ 1005     format(' WARNING: compiled for ', i5, ' processes ')
+ 1003     format(' Number of active processes: ', i5, /)
+
+       endif
+
+       call mpi_bcast(niter, 1, MPI_INTEGER, 
+     >                root, comm_setup, error)
+
+       call mpi_bcast(dt, 1, dp_type, 
+     >                root, comm_setup, error)
+
+       call mpi_bcast(grid_points(1), 3, MPI_INTEGER, 
+     >                root, comm_setup, error)
+
+       call mpi_bcast(timeron, 1, MPI_LOGICAL, 
+     >                root, comm_setup, error)
+
+
+       call make_set
+
+       do  c = 1, ncells
+          if ( (cell_size(1,c) .gt. IMAX) .or.
+     >         (cell_size(2,c) .gt. JMAX) .or.
+     >         (cell_size(3,c) .gt. KMAX) ) then
+             print *,node, c, (cell_size(i,c),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+          endif
+       end do
+
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call set_constants
+
+       call initialize
+
+       call lhsinit
+
+       call exact_rhs
+
+       call compute_buffer_size(5)
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+c---------------------------------------------------------------------
+c      Synchronize before placing time stamp
+c---------------------------------------------------------------------
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+       call mpi_barrier(comm_setup, error)
+
+       call timer_clear(1)
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (node .eq. root) then
+             if (mod(step, 20) .eq. 0 .or. 
+     >           step .eq. 1) then
+                write(*, 200) step
+ 200            format(' Time step ', i4)
+              endif
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       t = timer_read(1)
+       
+       call verify(niter, class, verified)
+
+       call mpi_reduce(t, tmax, 1, 
+     >                 dp_type, MPI_MAX, 
+     >                 root, comm_setup, error)
+
+       if( node .eq. root ) then
+          if( tmax .ne. 0. ) then
+             mflops = (881.174*float( problem_size )**3
+     >                -4683.91*float( problem_size )**2
+     >                +11484.5*float( problem_size )
+     >                -19272.4) * float( niter ) / (tmax*1000000.0d0)
+          else
+             mflops = 0.0
+          endif
+
+         call print_results('SP', class, grid_points(1), 
+     >     grid_points(2), grid_points(3), niter, maxcells*maxcells, 
+     >     total_nodes, tmax, mflops, '          floating point', 
+     >     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >     cs6, '(none)')
+       endif
+
+       if (.not.timeron) goto 999
+
+       do i = 1, t_last
+          t1(i) = timer_read(i)
+       end do
+       t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm)
+       t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm)
+       t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm)
+       t1(t_last+2) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch)
+       t1(t_last+1) = t1(t_total)  - t1(t_last+2)
+
+       call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM, 
+     >                 0, comm_setup, error)
+       call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, 
+     >                 0, comm_setup, error)
+       call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, 
+     >                 0, comm_setup, error)
+
+       if (node .eq. 0) then
+          write(*, 800) total_nodes
+          do i = 1, t_last+2
+             tsum(i) = tsum(i) / total_nodes
+             write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+          end do
+       endif
+ 800   format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', 
+     >        5x, 'average')
+ 810   format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999   continue
+       call mpi_barrier(MPI_COMM_WORLD, error)
+       call mpi_finalize(error)
+
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/txinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/txinvr.f
new file mode 100644
index 0000000..b5ca461
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/txinvr.f
@@ -0,0 +1,59 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  txinvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c block-diagonal matrix-vector multiplication                  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer c, i, j, k
+       double precision t1, t2, t3, ac, ru1, uu, vv, ww, r1, r2, r3, 
+     >                  r4, r5, ac2inv
+
+c---------------------------------------------------------------------
+c      loop over all cells owned by this node          
+c---------------------------------------------------------------------
+       do   c = 1, ncells
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                   ru1 = rho_i(i,j,k,c)
+                   uu = us(i,j,k,c)
+                   vv = vs(i,j,k,c)
+                   ww = ws(i,j,k,c)
+                   ac = speed(i,j,k,c)
+                   ac2inv = ainv(i,j,k,c)*ainv(i,j,k,c)
+
+                   r1 = rhs(i,j,k,1,c)
+                   r2 = rhs(i,j,k,2,c)
+                   r3 = rhs(i,j,k,3,c)
+                   r4 = rhs(i,j,k,4,c)
+                   r5 = rhs(i,j,k,5,c)
+
+                   t1 = c2 * ac2inv * ( qs(i,j,k,c)*r1 - uu*r2  - 
+     >                  vv*r3 - ww*r4 + r5 )
+                   t2 = bt * ru1 * ( uu * r1 - r2 )
+                   t3 = ( bt * ru1 * ac ) * t1
+
+                   rhs(i,j,k,1,c) = r1 - t1
+                   rhs(i,j,k,2,c) = - ru1 * ( ww*r1 - r4 )
+                   rhs(i,j,k,3,c) =   ru1 * ( vv*r1 - r3 )
+                   rhs(i,j,k,4,c) = - t2 + t3
+                   rhs(i,j,k,5,c) =   t2 + t3
+                end do
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/tzetar.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/tzetar.f
new file mode 100644
index 0000000..554066d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/tzetar.f
@@ -0,0 +1,60 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  tzetar(c)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, c
+       double precision  t1, t2, t3, ac, xvel, yvel, zvel, r1, r2, r3, 
+     >                   r4, r5, btuz, acinv, ac2u, uzik1
+
+c---------------------------------------------------------------------
+c      treat only one cell                                             
+c---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                xvel = us(i,j,k,c)
+                yvel = vs(i,j,k,c)
+                zvel = ws(i,j,k,c)
+                ac   = speed(i,j,k,c)
+                acinv = ainv(i,j,k,c)
+
+                ac2u = ac*ac
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)      
+
+                uzik1 = u(i,j,k,1,c)
+                btuz  = bt * uzik1
+
+                t1 = btuz*acinv * (r4 + r5)
+                t2 = r3 + t1
+                t3 = btuz * (r4 - r5)
+
+                rhs(i,j,k,1,c) = t2
+                rhs(i,j,k,2,c) = -uzik1*r2 + xvel*t2
+                rhs(i,j,k,3,c) =  uzik1*r1 + yvel*t2
+                rhs(i,j,k,4,c) =  zvel*t2  + t3
+                rhs(i,j,k,5,c) =  uzik1*(-xvel*r2 + yvel*r1) + 
+     >                    qs(i,j,k,c)*t2 + c2iv*ac2u*t1 + zvel*t3
+
+             end do
+          end do
+       end do
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/verify.f
new file mode 100644
index 0000000..08be79c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/verify.f
@@ -0,0 +1,358 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+        include 'mpinpb.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+        call error_norm(xce)
+        call copy_faces
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        if (node .ne. 0) return
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 100 time steps, with DT = 1.50d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 100    ))  then
+
+           class = 'S'
+           dtref = 1.5d-2
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.7470315451339479d-02
+           xcrref(2) = 1.0360746705285417d-02
+           xcrref(3) = 1.6235745065095532d-02
+           xcrref(4) = 1.5840557224455615d-02
+           xcrref(5) = 3.4849040609362460d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 2.7289258557377227d-05
+           xceref(2) = 1.0364446640837285d-05
+           xceref(3) = 1.6154798287166471d-05
+           xceref(4) = 1.5750704994480102d-05
+           xceref(5) = 3.4177666183390531d-05
+
+
+c---------------------------------------------------------------------
+c    reference data for 36X36X36 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 36) .and. 
+     >           (grid_points(2) .eq. 36) .and.
+     >           (grid_points(3) .eq. 36) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'W'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1893253733584d-02
+           xcrref(2) = 0.1717075447775d-03
+           xcrref(3) = 0.2778153350936d-03
+           xcrref(4) = 0.2887475409984d-03
+           xcrref(5) = 0.3143611161242d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.7542088599534d-04
+           xceref(2) = 0.6512852253086d-05
+           xceref(3) = 0.1049092285688d-04
+           xceref(4) = 0.1128838671535d-04
+           xceref(5) = 0.1212845639773d-03
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'A'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.4799822399300195d0
+           xcrref(2) = 1.1276337964368832d0
+           xcrref(3) = 1.5028977888770491d0
+           xcrref(4) = 1.4217816211695179d0
+           xcrref(5) = 2.1292113035138280d0
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 1.0900140297820550d-04
+           xceref(2) = 3.7343951769282091d-05
+           xceref(3) = 5.0092785406541633d-05
+           xceref(4) = 4.7671093939528255d-05
+           xceref(5) = 1.3621613399213001d-04
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 400 time steps,
+c    with DT = 1.0d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'B'
+           dtref = 1.0d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6903293579998d+02
+           xcrref(2) = 0.3095134488084d+02
+           xcrref(3) = 0.4103336647017d+02
+           xcrref(4) = 0.3864769009604d+02
+           xcrref(5) = 0.5643482272596d+02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.9810006190188d-02
+           xceref(2) = 0.1022827905670d-02
+           xceref(3) = 0.1720597911692d-02
+           xceref(4) = 0.1694479428231d-02
+           xceref(5) = 0.1847456263981d-01
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 400 time steps,
+c    with DT = 0.67d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'C'
+           dtref = 0.67d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.5881691581829d+03
+           xcrref(2) = 0.2454417603569d+03
+           xcrref(3) = 0.3293829191851d+03
+           xcrref(4) = 0.3081924971891d+03
+           xcrref(5) = 0.4597223799176d+03
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.2598120500183d+00
+           xceref(2) = 0.2590888922315d-01
+           xceref(3) = 0.5132886416320d-01
+           xceref(4) = 0.4806073419454d-01
+           xceref(5) = 0.5483377491301d+00
+
+c---------------------------------------------------------------------
+c    reference data for 408X408X408 grids after 500 time steps,
+c    with DT = 0.3d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'D'
+           dtref = 0.30d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1044696216887d+05
+           xcrref(2) = 0.3204427762578d+04
+           xcrref(3) = 0.4648680733032d+04
+           xcrref(4) = 0.4238923283697d+04
+           xcrref(5) = 0.7588412036136d+04
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.5089471423669d+01
+           xceref(2) = 0.5323514855894d+00
+           xceref(3) = 0.1187051008971d+01
+           xceref(4) = 0.1083734951938d+01
+           xceref(5) = 0.1164108338568d+02
+
+c---------------------------------------------------------------------
+c    reference data for 1020X1020X1020 grids after 500 time steps,
+c    with DT = 0.1d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'E'
+           dtref = 0.10d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6255387422609d+05
+           xcrref(2) = 0.1495317020012d+05
+           xcrref(3) = 0.2347595750586d+05
+           xcrref(4) = 0.2091099783534d+05
+           xcrref(5) = 0.4770412841218d+05
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.6742735164909d+02
+           xceref(2) = 0.5390656036938d+01
+           xceref(3) = 0.1680647196477d+02
+           xceref(4) = 0.1536963126457d+02
+           xceref(5) = 0.1575330146156d+03
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/x_solve.f
new file mode 100644
index 0000000..67e0a30
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/x_solve.f
@@ -0,0 +1,560 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the x-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the x-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+
+       integer i, j, k, jp, kp, n, iend, jsize, ksize, i1, i2,
+     >         buffer_size, c, m, p, istart, stage, error,
+     >         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,
+     >                   fac1, fac2
+
+
+
+c---------------------------------------------------------------------
+c      OK, now we know that there are multiple processors
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+c on this node in the direction of increasing i for the forward sweep,
+c and after that reversing the direction for the backsubstitution.
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c         = slice(1,stage)
+
+          istart = 0
+          iend   = cell_size(1,c)-1
+
+          jsize     = cell_size(2,c)
+          ksize     = cell_size(3,c)
+          jp        = cell_coord(2,c)-1
+          kp        = cell_coord(3,c)-1
+
+          buffer_size = (jsize-start(2,c)-end(2,c)) * 
+     >                  (ksize-start(3,c)-end(3,c))
+
+          if ( stage .ne. 1) then
+
+c---------------------------------------------------------------------
+c            if this is not the first processor in this row of cells, 
+c            receive data from predecessor containing the right hand
+c            sides and the upper diagonal elements of the previous two rows
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_irecv(in_buffer, 22*buffer_size, 
+     >                      dp_type, predecessor(1), 
+     >                      DEFAULT_TAG,  comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+
+c---------------------------------------------------------------------
+c            communication has already been started. 
+c            compute the left hand side while waiting for the msg
+c---------------------------------------------------------------------
+             call lhsx(c)
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c            This waits on the current receive and on the send
+c            from the previous stage. They always come in pairs. 
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_xcomm)
+
+c---------------------------------------------------------------------
+c            unpack the buffer                                 
+c---------------------------------------------------------------------
+             i  = istart
+             i1 = istart + 1
+             n = 0
+
+c---------------------------------------------------------------------
+c            create a running pointer
+c---------------------------------------------------------------------
+             p = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i1,j,k,n+1,c)
+                   lhs(i1,j,k,n+2,c) = lhs(i1,j,k,n+2,c) - d * r2
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -
+     >                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i1,j,k,n+1,c)
+                      lhs(i1,j,k,n+2,c) = lhs(i1,j,k,n+2,c) - d * r2
+                      lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) - e * r2
+                      rhs(i1,j,k,m,c)   = rhs(i1,j,k,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+c---------------------------------------------------------------------
+c            if this IS the first cell, we still compute the lhs
+c---------------------------------------------------------------------
+             call lhsx(c)
+          endif
+
+c---------------------------------------------------------------------
+c         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+c---------------------------------------------------------------------
+          n = 0
+
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = istart, iend-2
+                   i1 = i  + 1
+                   i2 = i  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -
+     >                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -
+     >                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -
+     >                         lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i2,j,k,n+2,c) = lhs(i2,j,k,n+2,c) -
+     >                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i2,j,k,n+3,c) = lhs(i2,j,k,n+3,c) -
+     >                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i2,j,k,m,c) = rhs(i2,j,k,m,c) -
+     >                         lhs(i2,j,k,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         The last two rows in this grid block are a bit different, 
+c         since they do not have two more rows available for the
+c         elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          i  = iend - 1
+          i1 = iend
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = start(2,c), jsize-end(2,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -
+     >                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -
+     >                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -
+     >                      lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                end do
+c---------------------------------------------------------------------
+c               scale the last row immediately (some of this is
+c               overkill in case this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i1,j,k,n+3,c)
+                lhs(i1,j,k,n+4,c) = fac2*lhs(i1,j,k,n+4,c)
+                lhs(i1,j,k,n+5,c) = fac2*lhs(i1,j,k,n+5,c)  
+                do    m = 1, 3
+                   rhs(i1,j,k,m,c) = fac2*rhs(i1,j,k,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+
+          do    m = 4, 5
+             n = (m-3)*5
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = istart, iend-2
+                   i1 = i  + 1
+                   i2 = i  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -
+     >                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -
+     >                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -
+     >                         lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i2,j,k,n+2,c) = lhs(i2,j,k,n+2,c) -
+     >                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i2,j,k,n+3,c) = lhs(i2,j,k,n+3,c) -
+     >                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i2,j,k,m,c) = rhs(i2,j,k,m,c) -
+     >                         lhs(i2,j,k,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c            And again the last two rows separately
+c---------------------------------------------------------------------
+             i  = iend - 1
+             i1 = iend
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -
+     >                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -
+     >                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i1,j,k,m,c)   = rhs(i1,j,k,m,c) -
+     >                      lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+c---------------------------------------------------------------------
+c               Scale the last row immediately (some of this is overkill
+c               if this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i1,j,k,n+3,c)
+                lhs(i1,j,k,n+4,c) = fac2*lhs(i1,j,k,n+4,c)
+                lhs(i1,j,k,n+5,c) = fac2*lhs(i1,j,k,n+5,c)
+                rhs(i1,j,k,m,c)   = fac2*rhs(i1,j,k,m,c)
+
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c         send information to the next processor, except when this
+c         is the last grid block
+c---------------------------------------------------------------------
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            create a running pointer for the send buffer  
+c---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-1, iend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      do    i = iend-1, iend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c send data to next phase
+c can't receive data yet because buffer size will be wrong 
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_isend(out_buffer, 22*buffer_size, 
+     >                     dp_type, successor(1), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+          endif
+       end do
+
+c---------------------------------------------------------------------
+c      now go in the reverse direction                      
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(1,stage)
+
+          istart = 0
+          iend   = cell_size(1,c)-1
+
+          jsize = cell_size(2,c)
+          ksize = cell_size(3,c)
+          jp    = cell_coord(2,c)-1
+          kp    = cell_coord(3,c)-1
+
+          buffer_size = (jsize-start(2,c)-end(2,c)) * 
+     >                  (ksize-start(3,c)-end(3,c))
+
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            if this is not the starting cell in this row of cells, 
+c            wait for a message to be received, containing the 
+c            solution of the previous two stations     
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_irecv(in_buffer, 10*buffer_size, 
+     >                      dp_type, successor(1), 
+     >                      DEFAULT_TAG, comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+
+c---------------------------------------------------------------------
+c            communication has already been started
+c            while waiting, do the block-diagonal inversion for the 
+c            cell that was just finished                
+c---------------------------------------------------------------------
+
+             call ninvr(slice(1,stage+1))
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_xcomm)
+
+c---------------------------------------------------------------------
+c            unpack the buffer for the first three factors         
+c---------------------------------------------------------------------
+             n = 0
+             p = 0
+             i  = iend
+             i1 = i - 1
+             do    m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -
+     >                        lhs(i1,j,k,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i1,j,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            now unpack the buffer for the remaining two factors
+c---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -
+     >                        lhs(i1,j,k,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i1,j,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+
+c---------------------------------------------------------------------
+c            now we know this is the first grid block on the back sweep,
+c            so we don't need a message to start the substitution. 
+c---------------------------------------------------------------------
+             i  = iend-1
+             i1 = iend
+             n = 0
+             do   m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+c---------------------------------------------------------------------
+c         Whether or not this is the last processor, we always have
+c         to complete the back-substitution 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c         The first three factors
+c---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-2, istart, -1
+                      i1 = i  + 1
+                      i2 = i  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i2,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And the remaining two
+c---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-2, istart, -1
+                      i1 = i  + 1
+                      i2 = i  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i2,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         send on information to the previous processor, if needed
+c---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             i  = istart
+             i1 = istart+1
+             p = 0
+             do    m = 1, 5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                       out_buffer(p+2) = rhs(i1,j,k,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            pack and send the buffer
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_isend(out_buffer, 10*buffer_size, 
+     >                     dp_type, predecessor(1), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+          endif
+
+c---------------------------------------------------------------------
+c         If this was the last stage, do the block-diagonal inversion          
+c---------------------------------------------------------------------
+          if (stage .eq. 1) call ninvr(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_xsolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/y_solve.f
new file mode 100644
index 0000000..2f17c17
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/y_solve.f
@@ -0,0 +1,553 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the y-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the y-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer i, j, k, stage, ip, kp, n, isize, jend, ksize, j1, j2,
+     >         buffer_size, c, m, p, jstart, error,
+     >         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,
+     >                   fac1, fac2
+
+
+c---------------------------------------------------------------------
+c now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+c on this node in the direction of increasing i for the forward sweep,
+c and after that reversing the direction for the backsubstitution  
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c      = slice(2,stage)
+
+          jstart = 0
+          jend   = cell_size(2,c)-1
+
+          isize     = cell_size(1,c)
+          ksize     = cell_size(3,c)
+          ip        = cell_coord(1,c)-1
+          kp        = cell_coord(3,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) * 
+     >                  (ksize-start(3,c)-end(3,c))
+
+          if ( stage .ne. 1) then
+
+c---------------------------------------------------------------------
+c            if this is not the first processor in this row of cells, 
+c            receive data from predecessor containing the right hand
+c            sides and the upper diagonal elements of the previous two rows
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_irecv(in_buffer, 22*buffer_size, 
+     >                      dp_type, predecessor(2), 
+     >                      DEFAULT_TAG, comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+c---------------------------------------------------------------------
+c            communication has already been started. 
+c            compute the left hand side while waiting for the msg
+c---------------------------------------------------------------------
+             call lhsy(c)
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c            This waits on the current receive and on the send
+c            from the previous stage. They always come in pairs. 
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_ycomm)
+
+c---------------------------------------------------------------------
+c            unpack the buffer                                 
+c---------------------------------------------------------------------
+             j  = jstart
+             j1 = jstart + 1
+             n = 0
+c---------------------------------------------------------------------
+c            create a running pointer
+c---------------------------------------------------------------------
+             p = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i,j1,k,n+1,c)
+                   lhs(i,j1,k,n+2,c) = lhs(i,j1,k,n+2,c) - d * r2
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -
+     >                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i,j1,k,n+1,c)
+                      lhs(i,j1,k,n+2,c) = lhs(i,j1,k,n+2,c) - d * r2
+                      lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) - e * r2
+                      rhs(i,j1,k,m,c)   = rhs(i,j1,k,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+c---------------------------------------------------------------------
+c            if this IS the first cell, we still compute the lhs
+c---------------------------------------------------------------------
+             call lhsy(c)
+          endif
+
+c---------------------------------------------------------------------
+c         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+c---------------------------------------------------------------------
+          n = 0
+
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = jstart, jend-2
+                do    i = start(1,c), isize-end(1,c)-1
+                   j1 = j  + 1
+                   j2 = j  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -
+     >                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -
+     >                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -
+     >                         lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j2,k,n+2,c) = lhs(i,j2,k,n+2,c) -
+     >                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j2,k,n+3,c) = lhs(i,j2,k,n+3,c) -
+     >                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j2,k,m,c) = rhs(i,j2,k,m,c) -
+     >                         lhs(i,j2,k,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         The last two rows in this grid block are a bit different, 
+c         since they do not have two more rows available for the
+c         elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          j  = jend - 1
+          j1 = jend
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -
+     >                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -
+     >                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -
+     >                      lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                end do
+c---------------------------------------------------------------------
+c               scale the last row immediately (some of this is
+c               overkill in case this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j1,k,n+3,c)
+                lhs(i,j1,k,n+4,c) = fac2*lhs(i,j1,k,n+4,c)
+                lhs(i,j1,k,n+5,c) = fac2*lhs(i,j1,k,n+5,c)  
+                do    m = 1, 3
+                   rhs(i,j1,k,m,c) = fac2*rhs(i,j1,k,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = jstart, jend-2
+                   do    i = start(1,c), isize-end(1,c)-1
+                   j1 = j  + 1
+                   j2 = j  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -
+     >                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -
+     >                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -
+     >                         lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i,j2,k,n+2,c) = lhs(i,j2,k,n+2,c) -
+     >                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j2,k,n+3,c) = lhs(i,j2,k,n+3,c) -
+     >                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j2,k,m,c) = rhs(i,j2,k,m,c) -
+     >                         lhs(i,j2,k,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c            And again the last two rows separately
+c---------------------------------------------------------------------
+             j  = jend - 1
+             j1 = jend
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -
+     >                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -
+     >                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i,j1,k,m,c)   = rhs(i,j1,k,m,c) -
+     >                      lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+c---------------------------------------------------------------------
+c               Scale the last row immediately (some of this is overkill
+c               if this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j1,k,n+3,c)
+                lhs(i,j1,k,n+4,c) = fac2*lhs(i,j1,k,n+4,c)
+                lhs(i,j1,k,n+5,c) = fac2*lhs(i,j1,k,n+5,c)
+                rhs(i,j1,k,m,c)   = fac2*rhs(i,j1,k,m,c)
+
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c         send information to the next processor, except when this
+c         is the last grid block;
+c---------------------------------------------------------------------
+
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            create a running pointer for the send buffer  
+c---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   do    j = jend-1, jend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      do    j = jend-1, jend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            pack and send the buffer
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_isend(out_buffer, 22*buffer_size, 
+     >                     dp_type, successor(2), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+          endif
+       end do
+
+c---------------------------------------------------------------------
+c      now go in the reverse direction                      
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(2,stage)
+
+          jstart = 0
+          jend   = cell_size(2,c)-1
+
+          isize = cell_size(1,c)
+          ksize = cell_size(3,c)
+          ip    = cell_coord(1,c)-1
+          kp    = cell_coord(3,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) * 
+     >                  (ksize-start(3,c)-end(3,c))
+
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            if this is not the starting cell in this row of cells, 
+c            wait for a message to be received, containing the 
+c            solution of the previous two stations     
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_irecv(in_buffer, 10*buffer_size, 
+     >                      dp_type, successor(2), 
+     >                      DEFAULT_TAG, comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+
+c---------------------------------------------------------------------
+c            communication has already been started
+c            while waiting, do the block-diagonal inversion for the 
+c            cell that was just finished                
+c---------------------------------------------------------------------
+
+             call pinvr(slice(2,stage+1))
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_ycomm)
+
+c---------------------------------------------------------------------
+c            unpack the buffer for the first three factors         
+c---------------------------------------------------------------------
+             n = 0
+             p = 0
+             j  = jend
+             j1 = j - 1
+             do    m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -
+     >                        lhs(i,j1,k,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i,j1,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            now unpack the buffer for the remaining two factors
+c---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -
+     >                        lhs(i,j1,k,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i,j1,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+c---------------------------------------------------------------------
+c            now we know this is the first grid block on the back sweep,
+c            so we don't need a message to start the substitution. 
+c---------------------------------------------------------------------
+
+             j  = jend - 1
+             j1 = jend
+             n = 0
+             do   m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+c---------------------------------------------------------------------
+c         Whether or not this is the last processor, we always have
+c         to complete the back-substitution 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c         The first three factors
+c---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = jend-2, jstart, -1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      j1 = j  + 1
+                      j2 = j  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i,j2,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And the remaining two
+c---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = jend-2, jstart, -1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      j1 = j  + 1
+                      j2 = j1 + 1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i,j2,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         send on information to the previous processor, if needed
+c---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             j  = jstart
+             j1 = jstart + 1
+             p = 0
+             do    m = 1, 5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                      out_buffer(p+2) = rhs(i,j1,k,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            pack and send the buffer
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_isend(out_buffer, 10*buffer_size, 
+     >                     dp_type, predecessor(2), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+          endif
+
+c---------------------------------------------------------------------
+c         If this was the last stage, do the block-diagonal inversion          
+c---------------------------------------------------------------------
+          if (stage .eq. 1) call pinvr(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_ysolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/z_solve.f
new file mode 100644
index 0000000..0d787f9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/SP/z_solve.f
@@ -0,0 +1,547 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the z-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the z-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+       include 'mpinpb.h'
+
+       integer i, j, k, stage, ip, jp, n, isize, jsize, kend, k1, k2,
+     >         buffer_size, c, m, p, kstart, error,
+     >         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,
+     >                   fac1, fac2
+
+c---------------------------------------------------------------------
+c now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+c on this node in the direction of increasing i for the forward sweep,
+c and after that reversing the direction for the backsubstitution  
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c         = slice(3,stage)
+
+          kstart = 0
+          kend   = cell_size(3,c)-1
+
+          isize     = cell_size(1,c)
+          jsize     = cell_size(2,c)
+          ip        = cell_coord(1,c)-1
+          jp        = cell_coord(2,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) * 
+     >                  (jsize-start(2,c)-end(2,c))
+
+          if (stage .ne. 1) then
+
+
+c---------------------------------------------------------------------
+c            if this is not the first processor in this row of cells, 
+c            receive data from predecessor containing the right hand
+c            sides and the upper diagonal elements of the previous two rows
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_irecv(in_buffer, 22*buffer_size, 
+     >                      dp_type, predecessor(3), 
+     >                      DEFAULT_TAG, comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+
+c---------------------------------------------------------------------
+c            communication has already been started. 
+c            compute the left hand side while waiting for the msg
+c---------------------------------------------------------------------
+             call lhsz(c)
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_waitall(2, requests, statuses, error)
+              if (timeron) call timer_stop(t_zcomm)
+            
+c---------------------------------------------------------------------
+c            unpack the buffer                                 
+c---------------------------------------------------------------------
+             k  = kstart
+             k1 = kstart + 1
+             n = 0
+
+c---------------------------------------------------------------------
+c            create a running pointer
+c---------------------------------------------------------------------
+             p = 0
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i,j,k1,n+1,c)
+                   lhs(i,j,k1,n+2,c) = lhs(i,j,k1,n+2,c) - d * r2
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -
+     >                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -
+     >                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -
+     >                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i,j,k1,n+1,c)
+                      lhs(i,j,k1,n+2,c) = lhs(i,j,k1,n+2,c) - d * r2
+                      lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) - e * r2
+                      rhs(i,j,k1,m,c)   = rhs(i,j,k1,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+c---------------------------------------------------------------------
+c            if this IS the first cell, we still compute the lhs
+c---------------------------------------------------------------------
+             call lhsz(c)
+          endif
+
+c---------------------------------------------------------------------
+c         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+c---------------------------------------------------------------------
+          n = 0
+
+          do    k = kstart, kend-2
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   k1 = k  + 1
+                   k2 = k  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -
+     >                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -
+     >                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -
+     >                         lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j,k2,n+2,c) = lhs(i,j,k2,n+2,c) -
+     >                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k2,n+3,c) = lhs(i,j,k2,n+3,c) -
+     >                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k2,m,c) = rhs(i,j,k2,m,c) -
+     >                         lhs(i,j,k2,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         The last two rows in this grid block are a bit different, 
+c         since they do not have two more rows available for the
+c         elimination of off-diagonal entries
+c---------------------------------------------------------------------
+          k  = kend - 1
+          k1 = kend
+          do    j = start(2,c), jsize-end(2,c)-1
+             do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -
+     >                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -
+     >                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -
+     >                      lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                end do
+c---------------------------------------------------------------------
+c               scale the last row immediately (some of this is
+c               overkill in case this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j,k1,n+3,c)
+                lhs(i,j,k1,n+4,c) = fac2*lhs(i,j,k1,n+4,c)
+                lhs(i,j,k1,n+5,c) = fac2*lhs(i,j,k1,n+5,c)  
+                do    m = 1, 3
+                   rhs(i,j,k1,m,c) = fac2*rhs(i,j,k1,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         do the u+c and the u-c factors               
+c---------------------------------------------------------------------
+          do   m = 4, 5
+             n = (m-3)*5
+             do    k = kstart, kend-2
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                   k1 = k  + 1
+                   k2 = k  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -
+     >                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -
+     >                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -
+     >                         lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i,j,k2,n+2,c) = lhs(i,j,k2,n+2,c) -
+     >                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k2,n+3,c) = lhs(i,j,k2,n+3,c) -
+     >                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k2,m,c) = rhs(i,j,k2,m,c) -
+     >                         lhs(i,j,k2,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c            And again the last two rows separately
+c---------------------------------------------------------------------
+             k  = kend - 1
+             k1 = kend
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -
+     >                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -
+     >                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i,j,k1,m,c)   = rhs(i,j,k1,m,c) -
+     >                      lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+c---------------------------------------------------------------------
+c               Scale the last row immediately (some of this is overkill
+c               if this is the last cell)
+c---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j,k1,n+3,c)
+                lhs(i,j,k1,n+4,c) = fac2*lhs(i,j,k1,n+4,c)
+                lhs(i,j,k1,n+5,c) = fac2*lhs(i,j,k1,n+5,c)
+                rhs(i,j,k1,m,c)   = fac2*rhs(i,j,k1,m,c)
+
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c         send information to the next processor, except when this
+c         is the last grid block,
+c---------------------------------------------------------------------
+
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            create a running pointer for the send buffer  
+c---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   do    k = kend-1, kend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      do    k = kend-1, kend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_isend(out_buffer, 22*buffer_size, 
+     >                     dp_type, successor(3), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+          endif
+       end do
+
+c---------------------------------------------------------------------
+c      now go in the reverse direction                      
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(3,stage)
+
+          kstart = 0
+          kend   = cell_size(3,c)-1
+
+          isize     = cell_size(1,c)
+          jsize     = cell_size(2,c)
+          ip        = cell_coord(1,c)-1
+          jp        = cell_coord(2,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) * 
+     >                  (jsize-start(2,c)-end(2,c))
+
+          if (stage .ne. ncells) then
+
+c---------------------------------------------------------------------
+c            if this is not the starting cell in this row of cells, 
+c            wait for a message to be received, containing the 
+c            solution of the previous two stations     
+c---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_irecv(in_buffer, 10*buffer_size, 
+     >                      dp_type, successor(3), 
+     >                      DEFAULT_TAG, comm_solve, 
+     >                      requests(1), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+
+c---------------------------------------------------------------------
+c            communication has already been started
+c            while waiting, do the  block-diagonal inversion for the 
+c            cell that was just finished                
+c---------------------------------------------------------------------
+
+             call tzetar(slice(3,stage+1))
+
+c---------------------------------------------------------------------
+c            wait for pending communication to complete
+c---------------------------------------------------------------------
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_zcomm)
+
+c---------------------------------------------------------------------
+c            unpack the buffer for the first three factors         
+c---------------------------------------------------------------------
+             n = 0
+             p = 0
+             k  = kend
+             k1 = k - 1
+             do    m = 1, 3
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -
+     >                        lhs(i,j,k1,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k1,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+c---------------------------------------------------------------------
+c            now unpack the buffer for the remaining two factors
+c---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k,n+4,c)*sm1 -
+     >                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -
+     >                        lhs(i,j,k1,n+4,c) * rhs(i,j,k,m,c) - 
+     >                        lhs(i,j,k1,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+
+c---------------------------------------------------------------------
+c            now we know this is the first grid block on the back sweep,
+c            so we don't need a message to start the substitution. 
+c---------------------------------------------------------------------
+
+             k  = kend - 1
+             k1 = kend
+             n = 0
+             do   m = 1, 3
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -
+     >                             lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+c---------------------------------------------------------------------
+c         Whether or not this is the last processor, we always have
+c         to complete the back-substitution 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c         The first three factors
+c---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = kend-2, kstart, -1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      k1 = k  + 1
+                      k2 = k  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i,j,k2,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And the remaining two
+c---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = kend-2, kstart, -1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      k1 = k  + 1
+                      k2 = k  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - 
+     >                          lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c) -
+     >                          lhs(i,j,k,n+5,c)*rhs(i,j,k2,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         send on information to the previous processor, if needed
+c---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             k  = kstart
+             k1 = kstart + 1
+             p = 0
+             do    m = 1, 5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                      out_buffer(p+2) = rhs(i,j,k1,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_isend(out_buffer, 10*buffer_size, 
+     >                     dp_type, predecessor(3), 
+     >                     DEFAULT_TAG, comm_solve, 
+     >                     requests(2), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+          endif
+
+c---------------------------------------------------------------------
+c         If this was the last stage, do the block-diagonal inversion
+c---------------------------------------------------------------------
+          if (stage .eq. 1) call tzetar(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_zsolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_print_results.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_print_results.c
new file mode 100644
index 0000000..44bc4d1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_print_results.c
@@ -0,0 +1,97 @@
+/*****************************************************************/
+/******     C  _  P  R  I  N  T  _  R  E  S  U  L  T  S     ******/
+/*****************************************************************/
+#include <stdlib.h>
+#include <stdio.h>
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_compiled,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags )
+{
+    char *evalue="1000";
+
+    printf( "\n\n %s Benchmark Completed\n", name ); 
+
+    printf( " Class           =                        %c\n", class );
+
+    if( n3 == 0 ) {
+        long nn = n1;
+        if ( n2 != 0 ) nn *= n2;
+        printf( " Size            =             %12ld\n", nn );   /* as in IS */
+    }
+    else
+        printf( " Size            =              %3dx %3dx %3d\n", n1,n2,n3 );
+
+    printf( " Iterations      =             %12d\n", niter );
+ 
+    printf( " Time in seconds =             %12.2f\n", t );
+
+    printf( " Total processes =             %12d\n", nprocs_total );
+
+    if ( nprocs_compiled != 0 )
+        printf( " Compiled procs  =             %12d\n", nprocs_compiled );
+
+    printf( " Mop/s total     =             %12.2f\n", mops );
+
+    printf( " Mop/s/process   =             %12.2f\n", mops/((float) nprocs_total) );
+
+    printf( " Operation type  = %24s\n", optype);
+
+    if( passed_verification )
+        printf( " Verification    =               SUCCESSFUL\n" );
+    else
+        printf( " Verification    =             UNSUCCESSFUL\n" );
+
+    printf( " Version         =             %12s\n", npbversion );
+
+    printf( " Compile date    =             %12s\n", compiletime );
+
+    printf( "\n Compile options:\n" );
+
+    printf( "    MPICC        = %s\n", mpicc );
+
+    printf( "    CLINK        = %s\n", clink );
+
+    printf( "    CMPI_LIB     = %s\n", cmpi_lib );
+
+    printf( "    CMPI_INC     = %s\n", cmpi_inc );
+
+    printf( "    CFLAGS       = %s\n", cflags );
+
+    printf( "    CLINKFLAGS   = %s\n", clinkflags );
+#ifdef SMP
+    evalue = getenv("MP_SET_NUMTHREADS");
+    printf( "   MULTICPUS = %s\n", evalue );
+#endif
+
+    printf( "\n\n" );
+    printf( " Please send feedbacks and/or the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " npb@nas.nasa.gov\n\n\n" );
+/*    printf( " Please send the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " Internet: npb@nas.nasa.gov\n \n" );
+    printf( " If email is not available, send this to:\n\n" );
+    printf( " MS T27A-1\n" );
+    printf( " NASA Ames Research Center\n" );
+    printf( " Moffett Field, CA  94035-1000\n\n" );
+    printf( " Fax: 650-604-3957\n\n" );*/
+}
+ 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_timers.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_timers.c
new file mode 100644
index 0000000..c8c81e7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/c_timers.c
@@ -0,0 +1,45 @@
+
+#include "mpi.h"
+
+double start[64], elapsed[64];
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  C  L  E  A  R          ******/
+/*****************************************************************/
+void timer_clear( int n )
+{
+    elapsed[n] = 0.0;
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  A  R  T          ******/
+/*****************************************************************/
+void timer_start( int n )
+{
+    start[n] = MPI_Wtime();
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  O  P             ******/
+/*****************************************************************/
+void timer_stop( int n )
+{
+    double t, now;
+
+    now = MPI_Wtime();
+    t = now - start[n];
+    elapsed[n] += t;
+
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  R  E  A  D             ******/
+/*****************************************************************/
+double timer_read( int n )
+{
+    return( elapsed[n] );
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/print_results.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/print_results.f
new file mode 100644
index 0000000..bd953b5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/print_results.f
@@ -0,0 +1,119 @@
+
+      subroutine print_results(name, class, n1, n2, n3, niter, 
+     >               nprocs_compiled, nprocs_total,
+     >               t, mops, optype, verified, npbversion, 
+     >               compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      
+      implicit none
+      character*2 name
+      character*1 class
+      integer n1, n2, n3, niter, nprocs_compiled, nprocs_total, j
+      double precision t, mops
+      character optype*24, size*15
+      logical verified
+      character*(*) npbversion, compiletime, 
+     >              cs1, cs2, cs3, cs4, cs5, cs6, cs7
+
+         write (*, 2) name 
+ 2       format(//, ' ', A2, ' Benchmark Completed.')
+
+         write (*, 3) Class
+ 3       format(' Class           = ', 12x, a12)
+
+c   If this is not a grid-based problem (EP, FT, CG), then
+c   we only print n1, which contains some measure of the
+c   problem size. In that case, n2 and n3 are both zero.
+c   Otherwise, we print the grid size n1xn2xn3
+
+         if ((n2 .eq. 0) .and. (n3 .eq. 0)) then
+            if (name(1:2) .eq. 'EP') then
+               write(size, '(f15.0)' ) 2.d0**n1
+               j = 15
+               if (size(j:j) .eq. '.') j = j - 1
+               write (*,42) size(1:j)
+ 42            format(' Size            = ',9x, a15)
+            else
+               write (*,44) n1
+ 44            format(' Size            = ',12x, i12)
+            endif
+         else
+            write (*, 4) n1,n2,n3
+ 4          format(' Size            =  ',9x, i4,'x',i4,'x',i4)
+         endif
+
+         write (*, 5) niter
+ 5       format(' Iterations      = ', 12x, i12)
+         
+         write (*, 6) t
+ 6       format(' Time in seconds = ',12x, f12.2)
+         
+         write (*,7) nprocs_total
+ 7       format(' Total processes = ', 12x, i12)
+         
+         write (*,8) nprocs_compiled
+ 8       format(' Compiled procs  = ', 12x, i12)
+
+         write (*,9) mops
+ 9       format(' Mop/s total     = ',12x, f12.2)
+
+         write (*,10) mops/float( nprocs_total )
+ 10      format(' Mop/s/process   = ', 12x, f12.2)        
+         
+         write(*, 11) optype
+ 11      format(' Operation type  = ', a24)
+
+         if (verified) then 
+            write(*,12) '  SUCCESSFUL'
+         else
+            write(*,12) 'UNSUCCESSFUL'
+         endif
+ 12      format(' Verification    = ', 12x, a)
+
+         write(*,13) npbversion
+ 13      format(' Version         = ', 12x, a12)
+
+         write(*,14) compiletime
+ 14      format(' Compile date    = ', 12x, a12)
+
+
+         write (*,121) cs1
+ 121     format(/, ' Compile options:', /, 
+     >          '    MPIF77       = ', A)
+
+         write (*,122) cs2
+ 122     format('    FLINK        = ', A)
+
+         write (*,123) cs3
+ 123     format('    FMPI_LIB     = ', A)
+
+         write (*,124) cs4
+ 124     format('    FMPI_INC     = ', A)
+
+         write (*,125) cs5
+ 125     format('    FFLAGS       = ', A)
+
+         write (*,126) cs6
+ 126     format('    FLINKFLAGS   = ', A)
+
+         write(*, 127) cs7
+ 127     format('    RAND         = ', A)
+        
+         write (*,130)
+ 130     format(//' Please send feedbacks and/or'
+     >            ' the results of this run to:'//
+     >            ' NPB Development Team '/
+     >            ' Internet: npb@nas.nasa.gov'//)
+c 130     format(//' Please send the results of this run to:'//
+c     >            ' NPB Development Team '/
+c     >            ' Internet: npb@nas.nasa.gov'/
+c     >            ' '/
+c     >            ' If email is not available, send this to:'//
+c     >            ' MS T27A-1'/
+c     >            ' NASA Ames Research Center'/
+c     >            ' Moffett Field, CA  94035-1000'//
+c     >            ' Fax: 650-604-3957'//)
+
+
+         return
+         end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.c
new file mode 100644
index 0000000..6766247
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.c
@@ -0,0 +1,64 @@
+//---------------------------------------------------------------------
+//   This function is C verson of random number generator randdp.f 
+//---------------------------------------------------------------------
+
+double	randlc(X, A)
+double *X;
+double *A;
+{
+      static int        KS=0;
+      static double	R23, R46, T23, T46;
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.f
new file mode 100644
index 0000000..64860d9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdp.f
@@ -0,0 +1,137 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function randlc (x, a)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+
+      return
+      end
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to VRANLC using the same arguments will generate a
+c   continuous sequence.  If N is zero, only initialization is performed, and
+c   the variables X, A and Y are ignored.
+c
+c   This routine is the standard version designed for scalar or RISC systems.
+c   However, it should produce the same results on any single processor
+c   computer with at least 48 mantissa bits in double precision floating point
+c   data.  On 64 bit systems, double precision should be disabled.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer i,n
+      double precision y,r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      dimension y(*)
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Generate N results.   This loop is not vectorizable.
+c---------------------------------------------------------------------
+      do i = 1, n
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+        t1 = r23 * x
+        x1 = int (t1)
+        x2 = x - t23 * x1
+        t1 = a1 * x2 + a2 * x1
+        t2 = int (r23 * t1)
+        z = t1 - t23 * t2
+        t3 = t23 * z + a2 * x2
+        t4 = int (r46 * t3)
+        x = t3 - t46 * t4
+        y(i) = r46 * x
+      enddo
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdpvec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdpvec.f
new file mode 100644
index 0000000..c708071
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randdpvec.f
@@ -0,0 +1,186 @@
+c---------------------------------------------------------------------
+      double precision function randlc (x, a)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+
+
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c   
+c   x_{k+1} = a x_k  (mod 2^46)
+c   
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to RANDLC using the same arguments will generate a
+c   continuous sequence.
+c   
+c   This routine generates the output sequence in batches of length NV, for
+c   convenience on vector computers.  This routine should produce the same
+c   results on any computer with at least 48 mantissa bits in double precision
+c   floating point data.  On Cray systems, double precision should be disabled.
+c   
+c   David H. Bailey    August 30, 1990
+c---------------------------------------------------------------------
+
+      integer n
+      double precision x, a, y(*)
+      
+      double precision r23, r46, t23, t46
+      integer nv
+      parameter (r23 = 2.d0 ** (-23), r46 = r23 * r23, t23 = 2.d0 ** 23,
+     >     t46 = t23 * t23, nv = 64)
+      double precision  xv(nv), t1, t2, t3, t4, an, a1, a2, x1, x2, yy
+      integer n1, i, j
+      external randlc
+      double precision randlc
+
+c---------------------------------------------------------------------
+c     Compute the first NV elements of the sequence using RANDLC.
+c---------------------------------------------------------------------
+      t1 = x
+      n1 = min (n, nv)
+
+      do  i = 1, n1
+         xv(i) = t46 * randlc (t1, a)
+      enddo
+
+c---------------------------------------------------------------------
+c     It is not necessary to compute AN, A1 or A2 unless N is greater than NV.
+c---------------------------------------------------------------------
+      if (n .gt. nv) then
+
+c---------------------------------------------------------------------
+c     Compute AN = AA ^ NV (mod 2^46) using successive calls to RANDLC.
+c---------------------------------------------------------------------
+         t1 = a
+         t2 = r46 * a
+
+         do  i = 1, nv - 1
+            t2 = randlc (t1, a)
+         enddo
+
+         an = t46 * t2
+
+c---------------------------------------------------------------------
+c     Break AN into two parts such that AN = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+         t1 = r23 * an
+         a1 = aint (t1)
+         a2 = an - t23 * a1
+      endif
+
+c---------------------------------------------------------------------
+c     Compute N pseudorandom results in batches of size NV.
+c---------------------------------------------------------------------
+      do  j = 0, n - 1, nv
+         n1 = min (nv, n - j)
+
+c---------------------------------------------------------------------
+c     Compute up to NV results based on the current seed vector XV.
+c---------------------------------------------------------------------
+         do  i = 1, n1
+            y(i+j) = r46 * xv(i)
+         enddo
+
+c---------------------------------------------------------------------
+c     If this is the last pass through the 140 loop, it is not necessary to
+c     update the XV vector.
+c---------------------------------------------------------------------
+         if (j + n1 .eq. n) goto 150
+
+c---------------------------------------------------------------------
+c     Update the XV vector by multiplying each element by AN (mod 2^46).
+c---------------------------------------------------------------------
+         do  i = 1, nv
+            t1 = r23 * xv(i)
+            x1 = aint (t1)
+            x2 = xv(i) - t23 * x1
+            t1 = a1 * x2 + a2 * x1
+            t2 = aint (r23 * t1)
+            yy = t1 - t23 * t2
+            t3 = t23 * yy + a2 * x2
+            t4 = aint (r46 * t3)
+            xv(i) = t3 - t46 * t4
+         enddo
+
+      enddo
+
+c---------------------------------------------------------------------
+c     Save the last seed in X so that subsequent calls to VRANLC will generate
+c     a continuous sequence.
+c---------------------------------------------------------------------
+ 150  x = xv(n1)
+
+      return
+      end
+
+c----- end of program ------------------------------------------------
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8.f
new file mode 100644
index 0000000..21ab881
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8.f
@@ -0,0 +1,79 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+      Lx = X
+      La = A
+
+      Lx   = iand(Lx*La,i246m1)
+      randlc = d2m46*dble(Lx)
+      x    = dble(Lx)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+c This doesn't work, because the compiler does the calculation in 32
+c bits and overflows. No standard way (without f90 stuff) to specify
+c that the rhs should be done in 64 bit arithmetic. 
+c      parameter(i246m1=2**46-1)
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+c Note that the v6 compiler on an R8000 does something stupid with
+c the above. Using the following instead (or various other things)
+c makes the calculation run almost 10 times as fast. 
+c 
+c      save d2m46
+c      data d2m46/0.0d0/
+c      if (d2m46 .eq. 0.0d0) then
+c         d2m46 = 0.5d0**46
+c      endif
+
+      Lx = X
+      La = A
+      do i = 1, N
+         Lx   = iand(Lx*La,i246m1)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x    = dble(Lx)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8_safe.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8_safe.f
new file mode 100644
index 0000000..f725b6a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/randi8_safe.f
@@ -0,0 +1,64 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = x
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      x1 = ibits(Lx, 23, 23)
+      x2 = ibits(Lx, 0, 23)
+      xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+      Lx   = ibits(xa,0, 46)
+      x    = dble(Lx)
+      randlc = d2m46*x
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = X
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      do i = 1, N
+         x1 = ibits(Lx, 23, 23)
+         x2 = ibits(Lx, 0, 23)
+         xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+         Lx   = ibits(xa,0, 46)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x = dble(Lx)
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/timers.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/timers.f
new file mode 100644
index 0000000..7a19ccf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/common/timers.f
@@ -0,0 +1,78 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine timer_clear(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+
+      elapsed(n) = 0.0
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_start(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      include 'mpif.h'
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+
+      start(n) = MPI_Wtime()
+
+      return
+      end
+      
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_stop(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      include 'mpif.h'
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+      double precision t, now
+      now = MPI_Wtime()
+      t = now - start(n)
+      elapsed(n) = elapsed(n) + t
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function timer_read(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+      
+      timer_read = elapsed(n)
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/README
new file mode 100644
index 0000000..ae535e9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/README
@@ -0,0 +1,7 @@
+This directory contains examples of make.def files that were used 
+by the NPB team in testing the benchmarks on different platforms. 
+They can be used as starting points for make.def files for your 
+own platform, but you may need to taylor them for best performance 
+on your installation. A clean template can be found in directory 
+`config'.
+Some examples of suite.def files are also provided.
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.dec_alpha b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.dec_alpha
new file mode 100644
index 0000000..44f0453
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.dec_alpha
@@ -0,0 +1,18 @@
+#This is for a DEC Alpha 8400. The code will execute on a 
+#single processor
+#Warning: parallel make does not work properly in general
+MPIF77  = f77
+FLINK   = f77
+#Optimization -O5 breaks SP; works fine for all other codes
+FFLAGS  = -O4
+
+MPICC   = cc
+CLINK   = cc
+CFLAGS  = -O5 
+
+include ../config/make.dummy
+
+CC      = cc -g
+BINDIR  = ../bin
+
+RAND   = randi8
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.ibm_aix64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.ibm_aix64
new file mode 100644
index 0000000..9ecf96a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.ibm_aix64
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIF77     - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIF77) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIF77) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIF77 = mpxlf -q64
+# This links MPI fortran programs; usually the same as ${MPIF77}
+FLINK	= $(MPIF77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qarch=auto -qtune=auto -qhot -qnosave
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpcc -q64
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qarch=auto -qtune=auto -qhot
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= cc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.irix6.2 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.irix6.2
new file mode 100644
index 0000000..f764047
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.irix6.2
@@ -0,0 +1,16 @@
+#This is for a generic single-processor SGI workstation
+MPIF77 = f77
+FLINK	= f77
+FFLAGS	= -O3
+
+MPICC = cc
+CLINK	= cc
+CFLAGS	= -O3 
+
+include ../config/make.dummy
+
+CC	= cc -g
+BINDIR	= ../bin
+
+RAND   = randi8
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.origin b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.origin
new file mode 100644
index 0000000..11c63c9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.origin
@@ -0,0 +1,20 @@
+# This is for a an SGI Origin 2000 or 3000 with vendor MPI. The Fortran
+# record length is specified, so it can be used for the I/O benchmark.
+# as well
+MPIF77   = f77 
+FMPI_LIB = -lmpi
+FLINK    = f77 -64
+FFLAGS   = -O3 -64
+
+MPICC    = cc
+CMPI_LIB = -lmpi
+CLINK    = cc
+CFLAGS   = -O3 
+
+CC       = cc -g
+BINDIR   = ../bin
+
+RAND   = randi8
+
+CONVERTFLAG = -DFORTRAN_REC_SIZE=4
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.pgi_mpich b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.pgi_mpich
new file mode 100644
index 0000000..3f16a11
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.pgi_mpich
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIF77     - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIF77) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIF77) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIF77 = mpif90
+# This links MPI fortran programs; usually the same as ${MPIF77}
+FLINK	= $(MPIF77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fastsse
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -fastsse
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fastsse
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -fastsse
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= pgcc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_altix b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_altix
new file mode 100644
index 0000000..fac2815
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_altix
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIF77     - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIF77) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIF77) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIF77 = ifort
+# This links MPI fortran programs; usually the same as ${MPIF77}
+FLINK	= $(MPIF77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -ip
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -ip
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = icc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -ip
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -ip
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= icc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=4
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_powerchallenge b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_powerchallenge
new file mode 100644
index 0000000..379726d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sgi_powerchallenge
@@ -0,0 +1,16 @@
+# This is for the SGI PowerChallenge Array at NASA Ames. mrf77 and 
+# mrcc are local scripts that invoke the proper MPI library.
+MPIF77 = mrf77
+FLINK  = mrf77
+FFLAGS = -O3 -OPT:fold_arith_limit=1204
+
+MPICC  = mrcc
+CLINK  = mrcc
+CFLAGS = -O3 -OPT:fold_arith_limit=1204
+
+CC     = cc -g
+BINDIR = ../bin
+
+RAND   = randi8
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sp2_babbage b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sp2_babbage
new file mode 100644
index 0000000..7896d56
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sp2_babbage
@@ -0,0 +1,17 @@
+#This is for the IBM SP2 at Ames; mrf77 and mrcc are local scripts
+MPIF77     = mrf77
+FLINK      = mrf77
+FFLAGS     = -O3 
+FLINKFLAGS = -bmaxdata:0x60000000
+
+MPICC      = mrcc
+CLINK      = mrcc
+CFLAGS     = -O3 
+CLINKFLAGS = -bmaxdata:0x60000000
+
+CC         = cc -g
+
+BINDIR     = ../bin
+
+RAND       = randi8
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sun_ultra_sparc b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sun_ultra_sparc
new file mode 100644
index 0000000..420dfde
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.sun_ultra_sparc
@@ -0,0 +1,30 @@
+#    This is for a Sun SparcCenter or UltraEnterprise machine 
+MPIF77     = f77
+FLINK      = f77
+FMPI_LIB   = -L<your mpich installation tree>/lib/solaris/ch_lfshmem -lmpi
+FMPI_INC   = -I<your mpich installation tree>/include
+#    sparc10,20 SparcCenter{1,2}000 (uname -m returns sun4m)
+#    and f77 -V returns 4.0 or greater
+# FFLAGS   = -fast -xtarget=super -xO4 -depend
+#    Ultra1,2, UltraEnterprise servers (uname -m returns sun4u)
+FFLAGS     = -fast -xtarget=ultra -xarch=v8plus -xO4 -depend
+FLINKFLAGS = -lmopt -lcopt -lsunmath
+
+MPICC      = cc
+CLINK      = cc
+CMPI_LIB   = -L<your mpich installation tree>/lib/solaris/ch_lfshmem -lmpi
+CMPI_INC   = -I<your mpich installation tree>/include
+#    sparc10,20 SparcCenter{1,2}000 (uname -m returns sun4m)
+#    and cc -V returns 4.0 or greater
+#CFLAGS	   =  -fast -xtarget=super -xO4 -xdepend
+#    Ultra1,2, UltraEnterprise servers (uname -m returns sun4u)
+CFLAGS     =  -fast -xtarget=ultra -xarch=v8plus -xO4 -xdepend
+CLINKFLAGS = -fast
+
+CC         = cc -g
+
+BINDIR     = ../bin
+
+#    Cannot use randi8 or randi8-safe on a 32-but machine. Use double precision
+RAND       = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.t3d_cosmos b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.t3d_cosmos
new file mode 100644
index 0000000..d3b3bbf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def.t3d_cosmos
@@ -0,0 +1,25 @@
+#This is for the Cray T3D at the Jet Propulsion Laboratory
+MPIF77     = cf77
+FLINK      = cf77
+FMPI_LIB   = -L/usr/local/mpp/lib -lmpi
+FMPI_INC   = -I/usr/local/mpp/lib/include/mpp
+FFLAGS     = -dp -Wf-onoieeedivide -C cray-t3d 
+#The following flags provide more effective optimization, but may
+#cause the random number generator randi8(_safe) to break in EP
+#FFLAGS    = -dp -Wf-oaggress -Wf-onoieeedivide -C cray-t3d 
+FLINKFLAGS = -Wl-Drdahead=on -C cray-t3d
+
+MPICC      = cc
+CLINK	   = cc
+CMPI_LIB   = -L/usr/local/mpp/lib -lmpi
+CMPI_INC   = -I/usr/local/mpp/lib/include/mpp
+CFLAGS	   = -O3 -Tcray-t3d
+CLINKFLAGS = -Tcray-t3d
+
+CC	   = cc -g -Tcray-ymp
+BINDIR	   = ../bin
+
+CONVERTFLAG= -DCONVERTDOUBLE
+
+RAND       = randi8
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def_sun_mpich b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def_sun_mpich
new file mode 100644
index 0000000..99b0b69
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/make.def_sun_mpich
@@ -0,0 +1,165 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+# (Note these definitions are inconsistent with NPB2.1.)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIF77     - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIF77) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIF77) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIF77 = mpif77
+# This links MPI fortran programs; usually the same as ${MPIF77}
+FLINK	= $(MPIF77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -fast
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fast
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -fast
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= cc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in Doc/README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.bt b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.bt
new file mode 100644
index 0000000..f330636
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.bt
@@ -0,0 +1,37 @@
+bt	S	1
+bt	S	4
+bt	S	9
+bt	S	16
+bt	A	1
+bt	A	4
+bt	A	9
+bt	A	16
+bt	A	25
+bt	A	36
+bt	A	49
+bt	A	64
+bt	A	81
+bt	A	100
+bt	A	121
+bt	B	1
+bt	B	4
+bt	B	9
+bt	B	16
+bt	B	25
+bt	B	36
+bt	B	49
+bt	B	64
+bt	B	81
+bt	B	100
+bt	B	121
+bt	C	1
+bt	C	4
+bt	C	9
+bt	C	16
+bt	C	25
+bt	C	36
+bt	C	49
+bt	C	64
+bt	C	81
+bt	C	100
+bt	C	121
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.cg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.cg
new file mode 100644
index 0000000..393bc50
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.cg
@@ -0,0 +1,29 @@
+cg	S	1
+cg	S	2
+cg	S	4
+cg	S	8
+cg	S	16
+cg	A	1
+cg	A	2
+cg	A	4
+cg	A	8
+cg	A	16
+cg	A	32
+cg	A	64
+cg	A	128
+cg	B	1
+cg	B	2
+cg	B	4
+cg	B	8
+cg	B	16
+cg	B	32
+cg	B	64
+cg	B	128
+cg	C	1
+cg	C	2
+cg	C	4
+cg	C	8
+cg	C	16
+cg	C	32
+cg	C	64
+cg	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ep b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ep
new file mode 100644
index 0000000..e2ca3cd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ep
@@ -0,0 +1,29 @@
+ep	S	1
+ep	S	2
+ep	S	4
+ep	S	8
+ep	S	16
+ep	A	1
+ep	A	2
+ep	A	4
+ep	A	8
+ep	A	16
+ep	A	32
+ep	A	64
+ep	A	128
+ep	B	1
+ep	B	2
+ep	B	4
+ep	B	8
+ep	B	16
+ep	B	32
+ep	B	64
+ep	B	128
+ep	C	1
+ep	C	2
+ep	C	4
+ep	C	8
+ep	C	16
+ep	C	32
+ep	C	64
+ep	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ft b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ft
new file mode 100644
index 0000000..6f05189
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.ft
@@ -0,0 +1,29 @@
+ft	S	1
+ft	S	2
+ft	S	4
+ft	S	8
+ft	S	16
+ft	A	1
+ft	A	2
+ft	A	4
+ft	A	8
+ft	A	16
+ft	A	32
+ft	A	64
+ft	A	128
+ft	B	1
+ft	B	2
+ft	B	4
+ft	B	8
+ft	B	16
+ft	B	32
+ft	B	64
+ft	B	128
+ft	C	1
+ft	C	2
+ft	C	4
+ft	C	8
+ft	C	16
+ft	C	32
+ft	C	64
+ft	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.is b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.is
new file mode 100644
index 0000000..97e898d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.is
@@ -0,0 +1,29 @@
+is	S	1
+is	S	2
+is	S	4
+is	S	8
+is	S	16
+is	A	1
+is	A	2
+is	A	4
+is	A	8
+is	A	16
+is	A	32
+is	A	64
+is	A	128
+is	B	1
+is	B	2
+is	B	4
+is	B	8
+is	B	16
+is	B	32
+is	B	64
+is	B	128
+is	C	1
+is	C	2
+is	C	4
+is	C	8
+is	C	16
+is	C	32
+is	C	64
+is	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.lu b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.lu
new file mode 100644
index 0000000..442e0b6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.lu
@@ -0,0 +1,29 @@
+lu	S	1
+lu	S	2
+lu	S	4
+lu	S	8
+lu	S	16
+lu	A	1
+lu	A	2
+lu	A	4
+lu	A	8
+lu	A	16
+lu	A	32
+lu	A	64
+lu	A	128
+lu	B	1
+lu	B	2
+lu	B	4
+lu	B	8
+lu	B	16
+lu	B	32
+lu	B	64
+lu	B	128
+lu	C	1
+lu	C	2
+lu	C	4
+lu	C	8
+lu	C	16
+lu	C	32
+lu	C	64
+lu	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.mg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.mg
new file mode 100644
index 0000000..b5c01d4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.mg
@@ -0,0 +1,29 @@
+mg	S	1
+mg	S	2
+mg	S	4
+mg	S	8
+mg	S	16
+mg	A	1
+mg	A	2
+mg	A	4
+mg	A	8
+mg	A	16
+mg	A	32
+mg	A	64
+mg	A	128
+mg	B	1
+mg	B	2
+mg	B	4
+mg	B	8
+mg	B	16
+mg	B	32
+mg	B	64
+mg	B	128
+mg	C	1
+mg	C	2
+mg	C	4
+mg	C	8
+mg	C	16
+mg	C	32
+mg	C	64
+mg	C	128
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.small b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.small
new file mode 100644
index 0000000..5a09404
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.small
@@ -0,0 +1,8 @@
+bt	S	1
+cg	S	1
+ep	S	1
+ft	S	1
+is	S	1
+lu	S	1
+mg	S	1
+sp	S	1
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.sp b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.sp
new file mode 100644
index 0000000..f8113a2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/NAS.samples/suite.def.sp
@@ -0,0 +1,37 @@
+sp	S	1
+sp	S	4
+sp	S	9
+sp	S	16
+sp	A	1
+sp	A	4
+sp	A	9
+sp	A	16
+sp	A	25
+sp	A	36
+sp	A	49
+sp	A	64
+sp	A	81
+sp	A	100
+sp	A	121
+sp	B	1
+sp	B	4
+sp	B	9
+sp	B	16
+sp	B	25
+sp	B	36
+sp	B	49
+sp	B	64
+sp	B	81
+sp	B	100
+sp	B	121
+sp	C	1
+sp	C	4
+sp	C	9
+sp	C	16
+sp	C	25
+sp	C	36
+sp	C	49
+sp	C	64
+sp	C	81
+sp	C	100
+sp	C	121
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.def.template
new file mode 100644
index 0000000..8cccc29
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.def.template
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIF77     - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIF77) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIF77) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIF77 = f77
+# This links MPI fortran programs; usually the same as ${MPIF77}
+FLINK	= $(MPIF77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = -L/usr/local/lib -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = -I/usr/local/include
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = cc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = -L/usr/local/lib -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = -I/usr/local/include
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= cc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+# CONVERTFLAG	= -DFORTRAN_REC_SIZE=length
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.dummy b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.dummy
new file mode 100644
index 0000000..16b2350
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/make.dummy
@@ -0,0 +1,7 @@
+FMPI_LIB  = -L../MPI_dummy -lmpi
+FMPI_INC  = -I../MPI_dummy
+CMPI_LIB  = -L../MPI_dummy -lmpi
+CMPI_INC  = -I../MPI_dummy
+default:: ${PROGRAM} libmpi.a
+libmpi.a: 
+	cd ../MPI_dummy; $(MAKE) F77=$(MPIF77) CC=$(MPICC)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/suite.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/suite.def.template
new file mode 100644
index 0000000..aea8b23
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/config/suite.def.template
@@ -0,0 +1,24 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command. 
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file. 
+# Each line of this file contains a benchmark name, class, and number
+# of nodes. The name is one of "cg", "is", "ep", mg", "ft", "sp", "bt", 
+# "lu", and "dt". 
+# The class is one of "S", "W", "A", "B", "C", "D", and "E"
+# (except that no classes C, D and E for DT, and no class E for IS).
+# The number of nodes must be a legal number for a particular
+# benchmark. The utility which parses this file is primitive, so
+# formatting is inflexible. Separate name/class/number by tabs. 
+# Comments start with "#" as the first character on a line. 
+# No blank lines. 
+# The following example builds 1 processor sample sizes of all benchmarks. 
+ft	S	1
+mg	S	1
+sp	S	1
+lu	S	1
+bt	S	1
+is	S	1
+ep	S	1
+cg	S	1
+dt	S	1
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/Makefile
new file mode 100644
index 0000000..56d1c44
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/Makefile
@@ -0,0 +1,22 @@
+include ../config/make.def
+
+# Note that COMPILE is also defined in make.common and should
+# be the same. We can't include make.common because it has a lot
+# of other garbage. LINK is not defined in make.common because
+# ${MPI_LIB} needs to go at the end of the line. 
+FCOMPILE = $(MPIF77) -c $(FMPI_INC) $(FFLAGS)
+
+all: setparams 
+
+# setparams creates an npbparam.h file for each benchmark 
+# configuration. npbparams.h also contains info about how a benchmark
+# was compiled and linked
+
+setparams: setparams.c ../config/make.def
+	$(CC) ${CONVERTFLAG} -o setparams setparams.c
+
+
+clean: 
+	-rm -f setparams setparams.h npbparams.h
+	-rm -f *~ *.o
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/README
new file mode 100644
index 0000000..3c97c52
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/README
@@ -0,0 +1,39 @@
+This directory contains utilities and files used by the 
+build process. You should not need to change anything
+in this directory. 
+
+Original Files
+--------------
+setparams.c:
+        Source for the setparams program. This program is used internally
+        in the build process to create the file "npbparams.h" for each 
+        benchmark. npbparams.h contains Fortran or C parameters to build a 
+        benchmark for a specific class and number of nodes. The setparams 
+        program is never run directly by a user. Its invocation syntax is 
+        "setparams benchmark-name nprocs class". 
+        It examines the file "npbparams.h" in the current directory. If 
+        the specified parameters are the same as those in the npbparams.h 
+        file, nothing it changed. If the file does not exist or corresponds 
+        to a different class/number of nodes, it is (re)built. 
+	One of the more complicated things in npbparams.h is that it 
+        contains, in a Fortran string, the compiler flags used to build a 
+        benchmark, so that a benchmark can print out how it was compiled. 
+
+make.common
+        A makefile segment that is included in each individual benchmark
+        program makefile. It sets up some standard macros (COMPILE, etc) 
+        and makes sure everything is configured correctly (npbparams.h)
+
+Makefile
+        Builds  setparams
+
+README
+        This file. 
+
+
+Created files
+-------------
+
+setparams
+	See descriptions above
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/make.common b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/make.common
new file mode 100644
index 0000000..4469596
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/make.common
@@ -0,0 +1,54 @@
+PROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).$(NPROCS)
+FCOMPILE = $(MPIF77) -c $(FMPI_INC) $(FFLAGS)
+CCOMPILE = $(MPICC)  -c $(CMPI_INC) $(CFLAGS)
+
+# Class "U" is used internally by the setparams program to mean
+# "unknown". This means that if you don't specify CLASS=
+# on the command line, you'll get an error. It would be nice
+# to be able to avoid this, but we'd have to get information
+# from the setparams back to the make program, which isn't easy. 
+CLASS=U
+NPROCS=1
+
+default:: ${PROGRAM}
+
+# This makes sure the configuration utility setparams 
+# is up to date. 
+# Note that this must be run every time, which is why the
+# target does not exist and is not created. 
+# If you create a file called "config" you will break things. 
+config:
+	@cd ../sys; ${MAKE} all
+	../sys/setparams ${BENCHMARK} ${NPROCS} ${CLASS} ${SUBTYPE}
+
+COMMON=../common
+${COMMON}/${RAND}.o: ${COMMON}/${RAND}.f
+	cd ${COMMON}; ${FCOMPILE} ${RAND}.f
+${COMMON}/c_randdp.o: ${COMMON}/randdp.c
+	cd ${COMMON}; ${CCOMPILE} -o c_randdp.o randdp.c
+
+${COMMON}/print_results.o: ${COMMON}/print_results.f
+	cd ${COMMON}; ${FCOMPILE} print_results.f
+
+${COMMON}/c_print_results.o: ${COMMON}/c_print_results.c
+	cd ${COMMON}; ${CCOMPILE} c_print_results.c
+
+${COMMON}/timers.o: ${COMMON}/timers.f
+	cd ${COMMON}; ${FCOMPILE} timers.f
+
+${COMMON}/c_timers.o: ${COMMON}/c_timers.c
+	cd ${COMMON}; ${CCOMPILE} c_timers.c
+
+# Normally setparams updates npbparams.h only if the settings (CLASS/NPROCS)
+# have changed. However, we also want to update if the compile options
+# may have changed (set in ../config/make.def). 
+npbparams.h: ../config/make.def
+	@ echo make.def modified. Rebuilding npbparams.h just in case
+	rm -f npbparams.h
+	../sys/setparams ${BENCHMARK} ${NPROCS} ${CLASS} ${SUBTYPE}
+
+# So that "make benchmark-name" works
+${BENCHMARK}:  default
+${BENCHMARKU}: default
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_header b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_header
new file mode 100755
index 0000000..4fdb578
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_header
@@ -0,0 +1,5 @@
+echo '   ========================================='
+echo '   =      NAS Parallel Benchmarks 3.3      ='
+echo '   =      MPI/F77/C                        ='
+echo '   ========================================='
+echo ''
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_instructions b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_instructions
new file mode 100755
index 0000000..d2f1999
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/print_instructions
@@ -0,0 +1,26 @@
+echo ''
+echo '   To make a NAS benchmark type '
+echo ''
+echo '         make <benchmark-name> NPROCS=<number> CLASS=<class> [SUBTYPE=<type>]'
+echo ''
+echo '   where <benchmark-name>  is "bt", "cg", "ep", "ft", "is", "lu",'
+echo '                              "mg", or "sp"'
+echo '         <number>          is the number of processors'
+echo '         <class>           is "S", "W", "A", "B", "C", or "D"'
+echo ''
+echo '   Only when making the I/O benchmark:'
+echo ''
+echo '         <benchmark-name>  is "bt"'
+echo '         <number>, <class> as above'
+echo '         <type>            is "full", "simple", "fortran", or "epio"'
+echo ''
+echo '   To make a set of benchmarks, create the file config/suite.def'
+echo '   according to the instructions in config/suite.def.template and type'
+echo ''
+echo '         make suite'
+echo ''
+echo ' ***************************************************************'
+echo ' * Remember to edit the file config/make.def for site specific *'
+echo ' * information as described in the README file                 *'
+echo ' ***************************************************************'
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/setparams.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/setparams.c
new file mode 100644
index 0000000..63d2442
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/setparams.c
@@ -0,0 +1,1224 @@
+/* 
+ * This utility configures a NPB to be built for a specific number
+ * of nodes and a specific class. It creates a file "npbparams.h" 
+ * in the source directory. This file keeps state information about 
+ * which size of benchmark is currently being built (so that nothing
+ * if unnecessarily rebuilt) and defines (through PARAMETER statements)
+ * the number of nodes and class for which a benchmark is being built. 
+
+ * The utility takes 3 arguments: 
+ *       setparams benchmark-name nprocs class
+ *    benchmark-name is "sp", "bt", etc
+ *    nprocs is the number of processors to run on
+ *    class is the size of the benchmark
+ * These parameters are checked for the current benchmark. If they
+ * are invalid, this program prints a message and aborts. 
+ * If the parameters are ok, the current npbsize.h (actually just
+ * the first line) is read in. If the new parameters are the same as 
+ * the old, nothing is done, but an exit code is returned to force the
+ * user to specify (otherwise the make procedure succeeds but builds a
+ * binary of the wrong name).  Otherwise the file is rewritten. 
+ * Errors write a message (to stdout) and abort. 
+ * 
+ * This program makes use of two extra benchmark "classes"
+ * class "X" means an invalid specification. It is returned if
+ * there is an error parsing the config file. 
+ * class "U" is an external specification meaning "unknown class"
+ * 
+ * Unfortunately everything has to be case sensitive. This is
+ * because we can always convert lower to upper or v.v. but
+ * can't feed this information back to the makefile, so typing
+ * make CLASS=a and make CLASS=A will produce different binaries.
+ *
+ * 
+ */
+
+#include <sys/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <time.h>
+
+/*
+ * This is the master version number for this set of 
+ * NPB benchmarks. It is in an obscure place so people
+ * won't accidentally change it. 
+ */
+
+#define VERSION "3.3.1"
+
+/* controls verbose output from setparams */
+/* #define VERBOSE */
+
+#define FILENAME "npbparams.h"
+#define DESC_LINE "c NPROCS = %d CLASS = %c\n"
+#define BT_DESC_LINE "c NPROCS = %d CLASS = %c SUBTYPE = %s\n"
+#define DEF_CLASS_LINE     "#define CLASS '%c'\n"
+#define DEF_NUM_PROCS_LINE "#define NUM_PROCS %d\n"
+#define FINDENT  "        "
+#define CONTINUE "     > "
+
+#ifdef FORTRAN_REC_SIZE
+int fortran_rec_size = FORTRAN_REC_SIZE;
+#else
+int fortran_rec_size = 4;
+#endif
+
+void get_info(int argc, char *argv[], int *typep, int *nprocsp, char *classp,
+	      int* subtypep);
+void check_info(int type, int nprocs, char class);
+void read_info(int type, int *nprocsp, char *classp, int *subtypep);
+void write_info(int type, int nprocs, char class, int subtype);
+void write_sp_info(FILE *fp, int nprocs, char class);
+void write_bt_info(FILE *fp, int nprocs, char class, int io);
+void write_lu_info(FILE *fp, int nprocs, char class);
+void write_mg_info(FILE *fp, int nprocs, char class);
+void write_cg_info(FILE *fp, int nprocs, char class);
+void write_ft_info(FILE *fp, int nprocs, char class);
+void write_ep_info(FILE *fp, int nprocs, char class);
+void write_is_info(FILE *fp, int nprocs, char class);
+void write_dt_info(FILE *fp, int nprocs, char class);
+void write_compiler_info(int type, FILE *fp);
+void write_convertdouble_info(int type, FILE *fp);
+void check_line(char *line, char *label, char *val);
+int  check_include_line(char *line, char *filename);
+void put_string(FILE *fp, char *name, char *val);
+void put_def_string(FILE *fp, char *name, char *val);
+void put_def_variable(FILE *fp, char *name, char *val);
+int isqrt(int i);
+int ilog2(int i);
+int ipow2(int i);
+int isqrt2(int i);
+
+enum benchmark_types {SP, BT, LU, MG, FT, IS, DT, EP, CG};
+enum iotypes { NONE = 0, FULL, SIMPLE, EPIO, FORTRAN};
+
+int main(int argc, char *argv[])
+{
+  int nprocs, nprocs_old, type;
+  char class, class_old;
+  int subtype = -1, old_subtype = -1;
+  
+  /* Get command line arguments. Make sure they're ok. */
+  get_info(argc, argv, &type, &nprocs, &class, &subtype);
+  if (class != 'U') {
+#ifdef VERBOSE
+    printf("setparams: For benchmark %s: number of processors = %d class = %c\n", 
+	   argv[1], nprocs, class); 
+#endif
+    check_info(type, nprocs, class);
+  }
+
+  /* Get old information. */
+  read_info(type, &nprocs_old, &class_old, &old_subtype);
+  if (class != 'U') {
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams:     old settings: number of processors = %d class = %c\n", 
+	     nprocs_old, class_old); 
+#endif
+    }
+  } else {
+    printf("setparams:\n\
+  *********************************************************************\n\
+  * You must specify NPROCS and CLASS to build this benchmark         *\n\
+  * For example, to build a class A benchmark for 4 processors, type  *\n\
+  *       make {benchmark-name} NPROCS=4 CLASS=A                      *\n\
+  *********************************************************************\n\n"); 
+
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams: Previous settings were CLASS=%c NPROCS=%d\n", 
+	     class_old, nprocs_old); 
+#endif
+    }
+    exit(1); /* exit on class==U */
+  }
+
+  /* Write out new information if it's different. */
+  if (nprocs != nprocs_old || class != class_old || subtype != old_subtype) {
+#ifdef VERBOSE
+    printf("setparams: Writing %s\n", FILENAME); 
+#endif
+    write_info(type, nprocs, class, subtype);
+  } else {
+#ifdef VERBOSE
+    printf("setparams: Settings unchanged. %s unmodified\n", FILENAME); 
+#endif
+  }
+
+  return 0;
+}
+
+
+/*
+ *  get_info(): Get parameters from command line 
+ */
+
+void get_info(int argc, char *argv[], int *typep, int *nprocsp, char *classp,
+	      int *subtypep) 
+{
+
+  if (argc < 4) {
+    printf("Usage: %s (%d) benchmark-name nprocs class\n", argv[0], argc);
+    exit(1);
+  }
+
+  *nprocsp = atoi(argv[2]);
+
+  *classp = *argv[3];
+
+  if      (!strcmp(argv[1], "sp") || !strcmp(argv[1], "SP")) *typep = SP;
+  else if (!strcmp(argv[1], "ft") || !strcmp(argv[1], "FT")) *typep = FT;
+  else if (!strcmp(argv[1], "lu") || !strcmp(argv[1], "LU")) *typep = LU;
+  else if (!strcmp(argv[1], "mg") || !strcmp(argv[1], "MG")) *typep = MG;
+  else if (!strcmp(argv[1], "is") || !strcmp(argv[1], "IS")) *typep = IS;
+  else if (!strcmp(argv[1], "dt") || !strcmp(argv[1], "DT")) *typep = DT;
+  else if (!strcmp(argv[1], "ep") || !strcmp(argv[1], "EP")) *typep = EP;
+  else if (!strcmp(argv[1], "cg") || !strcmp(argv[1], "CG")) *typep = CG;
+  else if (!strcmp(argv[1], "bt") || !strcmp(argv[1], "BT")) {
+    *typep = BT;
+    if (argc != 5) {
+      /* printf("Usage: %s (%d) benchmark-name nprocs class\n", argv[0], argc); */
+      /* exit(1); */
+      *subtypep = NONE;
+    } else {
+      if (!strcmp(argv[4], "full") || !strcmp(argv[4], "FULL")) {
+        *subtypep = FULL;
+      } else if (!strcmp(argv[4], "simple") || !strcmp(argv[4], "SIMPLE")) {
+        *subtypep = SIMPLE;
+      } else if (!strcmp(argv[4], "epio") || !strcmp(argv[4], "EPIO")) {
+        *subtypep = EPIO;
+      } else if (!strcmp(argv[4], "fortran") || !strcmp(argv[4], "FORTRAN")) {
+        *subtypep = FORTRAN;
+      } else if (!strcmp(argv[4], "none") || !strcmp(argv[4], "NONE")) {
+        *subtypep = NONE;
+      } else {
+        printf("setparams: Error: unknown btio type %s\n", argv[4]);
+        exit(1);
+      }
+    }
+  } else {
+    printf("setparams: Error: unknown benchmark type %s\n", argv[1]);
+    exit(1);
+  }
+}
+
+/*
+ *  check_info(): Make sure command line data is ok for this benchmark 
+ */
+
+void check_info(int type, int nprocs, char class) 
+{
+  int rootprocs, logprocs; 
+
+  /* check number of processors */
+  if (nprocs <= 0) {
+    printf("setparams: Number of processors must be greater than zero\n");
+    exit(1);
+  }
+  switch(type) {
+
+  case SP:
+  case BT:
+    rootprocs = isqrt(nprocs);
+    if (rootprocs < 0) {
+      printf("setparams: Number of processors %d must be a square (1,4,9,...) for this benchmark", 
+              nprocs);
+      exit(1);
+    }
+    if (class == 'S' && nprocs > 16) {
+      printf("setparams: BT and SP sample sizes cannot be run on more\n");
+      printf("           than 16 processors because the cell size would be too small.\n");
+      exit(1);
+    }
+    break;
+
+  case LU:
+    rootprocs = isqrt2(nprocs);
+    if (rootprocs < 0) {
+      printf("setparams: Failed to determine proc_grid for nprocs=%d\n", 
+              nprocs);
+      exit(1);
+    }
+    break;
+
+  case CG:
+  case FT:
+  case MG:
+  case IS:
+    logprocs = ilog2(nprocs);
+    if (logprocs < 0) {
+      printf("setparams: Number of processors must be a power of two (1,2,4,...) for this benchmark\n");
+      exit(1);
+    }
+
+    break;
+
+  case EP:
+  case DT:
+    break;
+
+  default:
+    /* never should have gotten this far with a bad name */
+    printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+    exit(1);
+  }
+
+  /* check class */
+  if (class != 'S' && 
+      class != 'W' && 
+      class != 'A' && 
+      class != 'B' && 
+      class != 'C' && 
+      class != 'D' && 
+      class != 'E') {
+    printf("setparams: Unknown benchmark class %c\n", class); 
+    printf("setparams: Allowed classes are \"S\", \"W\", and \"A\" through \"E\"\n");
+    exit(1);
+  }
+
+  if (class == 'E' && (type == IS || type == DT)) {
+    printf("setparams: Benchmark class %c not defined for IS or DT\n", class);
+    exit(1);
+  }
+
+  if (class == 'D' && type == IS && nprocs < 4) {
+    printf("setparams: IS class D size cannot be run on less than 4 processors\n");
+    exit(1);
+  }
+}
+
+
+/* 
+ * read_info(): Read previous information from file. 
+ *              Not an error if file doesn't exist, because this
+ *              may be the first time we're running. 
+ *              Assumes the first line of the file is in a special
+ *              format that we understand (since we wrote it). 
+ */
+
+void read_info(int type, int *nprocsp, char *classp, int *subtypep)
+{
+  int nread = 0;
+  FILE *fp;
+  fp = fopen(FILENAME, "r");
+  if (fp == NULL) {
+#ifdef VERBOSE
+    printf("setparams: INFO: configuration file %s does not exist (yet)\n", FILENAME); 
+#endif
+    goto abort;
+  }
+  
+  /* first line of file contains info (fortran), first two lines (C) */
+
+  switch(type) {
+      case BT: {
+	  char subtype_str[100];
+          nread = fscanf(fp, BT_DESC_LINE, nprocsp, classp, subtype_str);
+          if (nread != 3) {
+            if (nread != 2) {
+              printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+              goto abort;
+	    }
+	    *subtypep = 0;
+	    break;
+          }
+          if (!strcmp(subtype_str, "full") || !strcmp(subtype_str, "FULL")) {
+		*subtypep = FULL;
+          } else if (!strcmp(subtype_str, "simple") ||
+		     !strcmp(subtype_str, "SIMPLE")) {
+		*subtypep = SIMPLE;
+          } else if (!strcmp(subtype_str, "epio") || !strcmp(subtype_str, "EPIO")) {
+		*subtypep = EPIO;
+          } else if (!strcmp(subtype_str, "fortran") ||
+		     !strcmp(subtype_str, "FORTRAN")) {
+		*subtypep = FORTRAN;
+          } else {
+		*subtypep = -1;
+	  }
+          break;
+      }
+
+      case SP:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          nread = fscanf(fp, DESC_LINE, nprocsp, classp);
+          if (nread != 2) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      case IS:
+      case DT:
+          nread = fscanf(fp, DEF_CLASS_LINE, classp);
+          nread += fscanf(fp, DEF_NUM_PROCS_LINE, nprocsp);
+          if (nread != 2) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      default:
+        /* never should have gotten this far with a bad name */
+        printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+        exit(1);
+  }
+
+  fclose(fp);
+
+
+  return;
+
+ abort:
+  *nprocsp = -1;
+  *classp = 'X';
+  *subtypep = -1;
+  return;
+}
+
+
+/* 
+ * write_info(): Write new information to config file. 
+ *               First line is in a special format so we can read
+ *               it in again. Then comes a warning. The rest is all
+ *               specific to a particular benchmark. 
+ */
+
+void write_info(int type, int nprocs, char class, int subtype) 
+{
+  FILE *fp;
+  char *BT_TYPES[] = {"NONE", "FULL", "SIMPLE", "EPIO", "FORTRAN"};
+
+  fp = fopen(FILENAME, "w");
+  if (fp == NULL) {
+    printf("setparams: Can't open file %s for writing\n", FILENAME);
+    exit(1);
+  }
+
+  switch(type) {
+      case BT:
+          /* Write out the header */
+	  if (subtype == -1 || subtype == 0) {
+            fprintf(fp, DESC_LINE, nprocs, class);
+	  } else {
+            fprintf(fp, BT_DESC_LINE, nprocs, class, BT_TYPES[subtype]);
+	  }
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+c  \n\
+c  \n\
+c  This file is generated automatically by the setparams utility.\n\
+c  It sets the number of processors and the class of the NPB\n\
+c  in this directory. Do not modify it by hand.\n\
+c  \n");
+
+          break;
+	
+      case SP:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          /* Write out the header */
+          fprintf(fp, DESC_LINE, nprocs, class);
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+c  \n\
+c  \n\
+c  This file is generated automatically by the setparams utility.\n\
+c  It sets the number of processors and the class of the NPB\n\
+c  in this directory. Do not modify it by hand.\n\
+c  \n");
+
+          break;
+      case IS:
+      case DT:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, DEF_NUM_PROCS_LINE, nprocs);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.   */\n\
+   \n");
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+  /* Now do benchmark-specific stuff */
+  switch(type) {
+  case SP:
+    write_sp_info(fp, nprocs, class);
+    break;
+  case LU:
+    write_lu_info(fp, nprocs, class);
+    break;
+  case MG:
+    write_mg_info(fp, nprocs, class);
+    break;
+  case IS:
+    write_is_info(fp, nprocs, class);  
+    break;
+  case DT:
+    write_dt_info(fp, nprocs, class);  
+    break;
+  case FT:
+    write_ft_info(fp, nprocs, class);
+    break;
+  case EP:
+    write_ep_info(fp, nprocs, class);
+    break;
+  case CG:
+    write_cg_info(fp, nprocs, class);
+    break;
+  case BT:
+    write_bt_info(fp, nprocs, class, subtype);
+    break;
+  default:
+    printf("setparams: (Internal error): Unknown benchmark type %d\n", type);
+    exit(1);
+  }
+  write_convertdouble_info(type, fp);
+  write_compiler_info(type, fp);
+  fclose(fp);
+  return;
+}
+
+
+/* 
+ * write_sp_info(): Write SP specific info to config file
+ */
+
+void write_sp_info(FILE *fp, int nprocs, char class) 
+{
+  int maxcells, problem_size, niter;
+  char *dt;
+  maxcells = isqrt(nprocs);
+  if      (class == 'S') { problem_size = 12;  dt = "0.015d0";   niter = 100; }
+  else if (class == 'W') { problem_size = 36;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'B') { problem_size = 102; dt = "0.001d0";   niter = 400; }
+  else if (class == 'C') { problem_size = 162; dt = "0.00067d0"; niter = 400; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00030d0"; niter = 500; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.0001d0"; niter = 500; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger maxcells, problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (maxcells=%d, problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, maxcells, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_bt_info(): Write BT specific info to config file
+ */
+
+void write_bt_info(FILE *fp, int nprocs, char class, int io) 
+{
+  int maxcells, problem_size, niter, wr_interval;
+  char *dt;
+  maxcells = isqrt(nprocs);
+  if      (class == 'S') { problem_size = 12;  dt = "0.010d0";    niter = 60;  }
+  else if (class == 'W') { problem_size = 24;  dt = "0.0008d0";   niter = 200; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0008d0";   niter = 200; }
+  else if (class == 'B') { problem_size = 102; dt = "0.0003d0";   niter = 200; }
+  else if (class == 'C') { problem_size = 162; dt = "0.0001d0";   niter = 200; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00002d0";  niter = 250; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.4d-5";    niter = 250; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  wr_interval = 5;
+  fprintf(fp, "%sinteger maxcells, problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (maxcells=%d, problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, maxcells, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+  fprintf(fp, "%sinteger wr_default\n", FINDENT);
+  fprintf(fp, "%sparameter (wr_default = %d)\n", FINDENT, wr_interval);
+  fprintf(fp, "%sinteger iotype\n", FINDENT);
+  fprintf(fp, "%sparameter (iotype = %d)\n", FINDENT, io);
+  if (io) {
+    fprintf(fp, "%scharacter*(*) filenm\n", FINDENT);
+    switch (io) {
+	case FULL:
+	    fprintf(fp, "%sparameter (filenm = 'btio.full.out')\n", FINDENT);
+	    break;
+	case SIMPLE:
+	    fprintf(fp, "%sparameter (filenm = 'btio.simple.out')\n", FINDENT);
+	    break;
+	case EPIO:
+	    fprintf(fp, "%sparameter (filenm = 'btio.epio.out')\n", FINDENT);
+	    break;
+	case FORTRAN:
+	    fprintf(fp, "%sparameter (filenm = 'btio.fortran.out')\n", FINDENT);
+	    fprintf(fp, "%sinteger fortran_rec_sz\n", FINDENT);
+	    fprintf(fp, "%sparameter (fortran_rec_sz = %d)\n",
+		    FINDENT, fortran_rec_size);
+	    break;
+	default:
+	    break;
+    }
+  }
+}
+  
+
+
+/* 
+ * write_lu_info(): Write SP specific info to config file
+ */
+
+void write_lu_info(FILE *fp, int nprocs, char class) 
+{
+  int isiz1, isiz2, itmax, inorm, problem_size;
+  int xdiv, ydiv; /* number of cells in x and y direction */
+  char *dt_default;
+
+  if      (class == 'S') { problem_size = 12;  dt_default = "0.5d0";  itmax = 50; }
+  else if (class == 'W') { problem_size = 33;  dt_default = "1.5d-3"; itmax = 300; }
+  else if (class == 'A') { problem_size = 64;  dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'B') { problem_size = 102; dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'C') { problem_size = 162; dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'D') { problem_size = 408; dt_default = "1.0d0";  itmax = 300; }
+  else if (class == 'E') { problem_size = 1020; dt_default = "0.5d0"; itmax = 300; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  inorm = itmax;
+  xdiv = isqrt2(nprocs);
+  ydiv = nprocs/xdiv;
+  isiz1 = problem_size/xdiv; if (isiz1*xdiv < problem_size) isiz1++;
+  isiz2 = problem_size/ydiv; if (isiz2*ydiv < problem_size) isiz2++;
+  
+
+  fprintf(fp, "\nc number of nodes for which this version is compiled\n");
+  fprintf(fp, "%sinteger nnodes_compiled, nnodes_xdim\n", FINDENT);
+  fprintf(fp, "%sparameter (nnodes_compiled=%d, nnodes_xdim=%d)\n",
+          FINDENT, nprocs, xdiv);
+
+  fprintf(fp, "\nc full problem size\n");
+  fprintf(fp, "%sinteger isiz01, isiz02, isiz03\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz01=%d, isiz02=%d, isiz03=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+
+  fprintf(fp, "\nc sub-domain array size\n");
+  fprintf(fp, "%sinteger isiz1, isiz2, isiz3\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz1=%d, isiz2=%d, isiz3=isiz03)\n", 
+	       FINDENT, isiz1, isiz2);
+
+  fprintf(fp, "\nc number of iterations and how often to print the norm\n");
+  fprintf(fp, "%sinteger itmax_default, inorm_default\n", FINDENT);
+  fprintf(fp, "%sparameter (itmax_default=%d, inorm_default=%d)\n", 
+	  FINDENT, itmax, inorm);
+
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt_default);
+  
+}
+
+/* 
+ * write_mg_info(): Write MG specific info to config file
+ */
+
+void write_mg_info(FILE *fp, int nprocs, char class) 
+{
+  int problem_size, nit, log2_size, log2_nprocs, lt_default, lm;
+  int ndim1, ndim2, ndim3;
+  if      (class == 'S') { problem_size = 32;   nit = 4; }
+  else if (class == 'W') { problem_size = 128;  nit = 4; }
+  else if (class == 'A') { problem_size = 256;  nit = 4; }
+  else if (class == 'B') { problem_size = 256;  nit = 20; }
+  else if (class == 'C') { problem_size = 512;  nit = 20; }
+  else if (class == 'D') { problem_size = 1024; nit = 50; }
+  else if (class == 'E') { problem_size = 2048; nit = 50; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  log2_size = ilog2(problem_size);
+  log2_nprocs = ilog2(nprocs);
+  /* lt is log of largest total dimension */
+  lt_default = log2_size;
+  /* log of log of maximum dimension on a node */
+  lm = log2_size - log2_nprocs/3;
+  ndim1 = lm;
+  ndim3 = log2_size - (log2_nprocs+2)/3;
+  ndim2 = log2_size - (log2_nprocs+1)/3;
+
+  fprintf(fp, "%sinteger nprocs_compiled\n", FINDENT);
+  fprintf(fp, "%sparameter (nprocs_compiled = %d)\n", FINDENT, nprocs);
+  fprintf(fp, "%sinteger nx_default, ny_default, nz_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx_default=%d, ny_default=%d, nz_default=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+  fprintf(fp, "%sinteger nit_default, lm, lt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nit_default=%d, lm = %d, lt_default=%d)\n", 
+	  FINDENT, nit, lm, lt_default);
+  fprintf(fp, "%sinteger debug_default\n", FINDENT);
+  fprintf(fp, "%sparameter (debug_default=%d)\n", FINDENT, 0);
+  fprintf(fp, "%sinteger ndim1, ndim2, ndim3\n", FINDENT);
+  fprintf(fp, "%sparameter (ndim1 = %d, ndim2 = %d, ndim3 = %d)\n", 
+	  FINDENT, ndim1, ndim2, ndim3);
+}
+
+
+/* 
+ * write_dt_info(): Write DT specific info to config file
+ */
+
+void write_dt_info(FILE *fp, int nprocs, char class) 
+{
+  int num_samples,deviation,num_sources;
+  if      (class == 'S') { num_samples=1728; deviation=128; num_sources=4; }
+  else if (class == 'W') { num_samples=1728*8; deviation=128*2; num_sources=4*2; }
+  else if (class == 'A') { num_samples=1728*64; deviation=128*4; num_sources=4*4; }
+  else if (class == 'B') { num_samples=1728*512; deviation=128*8; num_sources=4*8; }
+  else if (class == 'C') { num_samples=1728*4096; deviation=128*16; num_sources=4*16; }
+  else if (class == 'D') { num_samples=1728*4096*8; deviation=128*32; num_sources=4*32; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "#define NUM_SAMPLES %d\n", num_samples);
+  fprintf(fp, "#define STD_DEVIATION %d\n", deviation);
+  fprintf(fp, "#define NUM_SOURCES %d\n", num_sources);
+}
+
+/* 
+ * write_is_info(): Write IS specific info to config file
+ */
+
+void write_is_info(FILE *fp, int nprocs, char class) 
+{
+  if( class != 'S' &&
+      class != 'W' &&
+      class != 'A' &&
+      class != 'B' &&
+      class != 'C' &&
+      class != 'D' )
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+}
+
+/* 
+ * write_cg_info(): Write CG specific info to config file
+ */
+
+void write_cg_info(FILE *fp, int nprocs, char class) 
+{
+  int na,nonzer,niter;
+  char *shift,*rcond="1.0d-1";
+  char *shiftS="10.",
+       *shiftW="12.",
+       *shiftA="20.",
+       *shiftB="60.",
+       *shiftC="110.",
+       *shiftD="500.",
+       *shiftE="1.5d3";
+
+  int num_proc_cols, num_proc_rows;
+
+
+  if( class == 'S' )
+  { na=1400;    nonzer=7;  niter=15;  shift=shiftS; }
+  else if( class == 'W' )
+  { na=7000;    nonzer=8;  niter=15;  shift=shiftW; }
+  else if( class == 'A' )
+  { na=14000;   nonzer=11; niter=15;  shift=shiftA; }
+  else if( class == 'B' )
+  { na=75000;   nonzer=13; niter=75;  shift=shiftB; }
+  else if( class == 'C' )
+  { na=150000;  nonzer=15; niter=75;  shift=shiftC; }
+  else if( class == 'D' )
+  { na=1500000; nonzer=21; niter=100; shift=shiftD; }
+  else if( class == 'E' )
+  { na=9000000; nonzer=26; niter=100; shift=shiftE; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf( fp, "%sinteger            na, nonzer, niter\n", FINDENT );
+  fprintf( fp, "%sdouble precision   shift, rcond\n", FINDENT );
+  fprintf( fp, "%sparameter(  na=%d,\n", FINDENT, na );
+  fprintf( fp, "%s             nonzer=%d,\n", CONTINUE, nonzer );
+  fprintf( fp, "%s             niter=%d,\n", CONTINUE, niter );
+  fprintf( fp, "%s             shift=%s,\n", CONTINUE, shift );
+  fprintf( fp, "%s             rcond=%s )\n", CONTINUE, rcond );
+
+
+  num_proc_cols = num_proc_rows = ilog2(nprocs)/2;
+  if (num_proc_cols+num_proc_rows != ilog2(nprocs)) num_proc_cols += 1;
+  num_proc_cols = ipow2(num_proc_cols); num_proc_rows = ipow2(num_proc_rows);
+  
+  fprintf( fp, "\nc number of nodes for which this version is compiled\n" );
+  fprintf( fp, "%sinteger    nnodes_compiled\n", FINDENT );
+  fprintf( fp, "%sparameter( nnodes_compiled = %d)\n", FINDENT, nprocs );
+  fprintf( fp, "%sinteger    num_proc_cols, num_proc_rows\n", FINDENT );
+  fprintf( fp, "%sparameter( num_proc_cols=%d, num_proc_rows=%d )\n", 
+                                                          FINDENT,
+                                                          num_proc_cols,
+                                                          num_proc_rows );
+}
+
+
+/* 
+ * write_ft_info(): Write FT specific info to config file
+ */
+
+void write_ft_info(FILE *fp, int nprocs, char class) 
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int nx, ny, nz, maxdim, niter;
+  if      (class == 'S') { nx = 64;   ny = 64;   nz = 64;   niter = 6;}
+  else if (class == 'W') { nx = 128;  ny = 128;  nz = 32;   niter = 6;}
+  else if (class == 'A') { nx = 256;  ny = 256;  nz = 128;  niter = 6;}
+  else if (class == 'B') { nx = 512;  ny = 256;  nz = 256;  niter =20;}
+  else if (class == 'C') { nx = 512;  ny = 512;  nz = 512;  niter =20;}
+  else if (class == 'D') { nx = 2048; ny = 1024; nz = 1024; niter =25;}
+  else if (class == 'E') { nx = 4096; ny = 2048; nz = 2048; niter =25;}
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  maxdim = nx;
+  if (ny > maxdim) maxdim = ny;
+  if (nz > maxdim) maxdim = nz;
+  fprintf(fp, "%sinteger nx, ny, nz, maxdim, niter_default, ntdivnp, np_min\n", FINDENT);
+  fprintf(fp, "%sparameter (nx=%d, ny=%d, nz=%d, maxdim=%d)\n", 
+          FINDENT, nx, ny, nz, maxdim);
+  fprintf(fp, "%sparameter (niter_default=%d)\n", FINDENT, niter);
+  fprintf(fp, "%sparameter (np_min = %d)\n", FINDENT, nprocs);
+  fprintf(fp, "%sparameter (ntdivnp=((nx*ny)/np_min)*nz)\n", FINDENT);
+  fprintf(fp, "%sdouble precision ntotal_f\n", FINDENT);
+  fprintf(fp, "%sparameter (ntotal_f=1.d0*nx*ny*nz)\n", FINDENT);
+}
+
+/*
+ * write_ep_info(): Write EP specific info to config file
+ */
+
+void write_ep_info(FILE *fp, int nprocs, char class)
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int m;
+  if      (class == 'S') { m = 24; }
+  else if (class == 'W') { m = 25; }
+  else if (class == 'A') { m = 28; }
+  else if (class == 'B') { m = 30; }
+  else if (class == 'C') { m = 32; }
+  else if (class == 'D') { m = 36; }
+  else if (class == 'E') { m = 40; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  /* number of processors given by "npm" */
+
+
+  fprintf(fp, "%scharacter class\n",FINDENT);
+  fprintf(fp, "%sparameter (class =\'%c\')\n",
+                  FINDENT, class);
+  fprintf(fp, "%sinteger m, npm\n", FINDENT);
+  fprintf(fp, "%sparameter (m=%d, npm=%d)\n",
+          FINDENT, m, nprocs);
+}
+
+
+/* 
+ * This is a gross hack to allow the benchmarks to 
+ * print out how they were compiled. Various other ways
+ * of doing this have been tried and they all fail on
+ * some machine - due to a broken "make" program, or
+ * F77 limitations, of whatever. Hopefully this will
+ * always work because it uses very portable C. Unfortunately
+ * it relies on parsing the make.def file - YUK. 
+ * If your machine doesn't have <string.h> or <ctype.h>, happy hacking!
+ * 
+ */
+
+#define VERBOSE
+#define LL 400
+#include <stdio.h>
+#define DEFFILE "../config/make.def"
+#define DEFAULT_MESSAGE "(none)"
+FILE *deffile;
+void write_compiler_info(int type, FILE *fp)
+{
+  char line[LL];
+  char mpif77[LL], flink[LL], fmpi_lib[LL], fmpi_inc[LL], fflags[LL], flinkflags[LL];
+  char compiletime[LL], randfile[LL];
+  char mpicc[LL], cflags[LL], clink[LL], clinkflags[LL],
+       cmpi_lib[LL], cmpi_inc[LL];
+  struct tm *tmp;
+  time_t t;
+  deffile = fopen(DEFFILE, "r");
+  if (deffile == NULL) {
+    printf("\n\
+setparams: File %s doesn't exist. To build the NAS benchmarks\n\
+           you need to create is according to the instructions\n\
+           in the README in the main directory and comments in \n\
+           the file config/make.def.template\n", DEFFILE);
+    exit(1);
+  }
+  strcpy(mpif77, DEFAULT_MESSAGE);
+  strcpy(flink, DEFAULT_MESSAGE);
+  strcpy(fmpi_lib, DEFAULT_MESSAGE);
+  strcpy(fmpi_inc, DEFAULT_MESSAGE);
+  strcpy(fflags, DEFAULT_MESSAGE);
+  strcpy(flinkflags, DEFAULT_MESSAGE);
+  strcpy(randfile, DEFAULT_MESSAGE);
+  strcpy(mpicc, DEFAULT_MESSAGE);
+  strcpy(cflags, DEFAULT_MESSAGE);
+  strcpy(clink, DEFAULT_MESSAGE);
+  strcpy(clinkflags, DEFAULT_MESSAGE);
+  strcpy(cmpi_lib, DEFAULT_MESSAGE);
+  strcpy(cmpi_inc, DEFAULT_MESSAGE);
+
+  while (fgets(line, LL, deffile) != NULL) {
+    if (*line == '#') continue;
+    /* yes, this is inefficient. but it's simple! */
+    check_line(line, "MPIF77", mpif77);
+    check_line(line, "FLINK", flink);
+    check_line(line, "FMPI_LIB", fmpi_lib);
+    check_line(line, "FMPI_INC", fmpi_inc);
+    check_line(line, "FFLAGS", fflags);
+    check_line(line, "FLINKFLAGS", flinkflags);
+    check_line(line, "RAND", randfile);
+    check_line(line, "MPICC", mpicc);
+    check_line(line, "CFLAGS", cflags);
+    check_line(line, "CLINK", clink);
+    check_line(line, "CLINKFLAGS", clinkflags);
+    check_line(line, "CMPI_LIB", cmpi_lib);
+    check_line(line, "CMPI_INC", cmpi_inc);
+    /* if the dummy library is used by including make.dummy, we set the
+       Fortran and C paths to libraries and headers accordingly     */
+    if(check_include_line(line, "../config/make.dummy")) {
+       strcpy(fmpi_lib, "-L../MPI_dummy -lmpi");
+       strcpy(fmpi_inc, "-I../MPI_dummy");
+       strcpy(cmpi_lib, "-L../MPI_dummy -lmpi");
+       strcpy(cmpi_inc, "-I../MPI_dummy");
+    }
+  }
+
+  
+  (void) time(&t);
+  tmp = localtime(&t);
+  (void) strftime(compiletime, (size_t)LL, "%d %b %Y", tmp);
+
+
+  switch(type) {
+      case FT:
+      case SP:
+      case BT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          put_string(fp, "compiletime", compiletime);
+          put_string(fp, "npbversion", VERSION);
+          put_string(fp, "cs1", mpif77);
+          put_string(fp, "cs2", flink);
+          put_string(fp, "cs3", fmpi_lib);
+          put_string(fp, "cs4", fmpi_inc);
+          put_string(fp, "cs5", fflags);
+          put_string(fp, "cs6", flinkflags);
+	  put_string(fp, "cs7", randfile);
+          break;
+      case IS:
+      case DT:
+          put_def_string(fp, "COMPILETIME", compiletime);
+          put_def_string(fp, "NPBVERSION", VERSION);
+          put_def_string(fp, "MPICC", mpicc);
+          put_def_string(fp, "CFLAGS", cflags);
+          put_def_string(fp, "CLINK", clink);
+          put_def_string(fp, "CLINKFLAGS", clinkflags);
+          put_def_string(fp, "CMPI_LIB", cmpi_lib);
+          put_def_string(fp, "CMPI_INC", cmpi_inc);
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+}
+
+void check_line(char *line, char *label, char *val)
+{
+  char *original_line;
+  int n;
+  original_line = line;
+  /* compare beginning of line and label */
+  while (*label != '\0' && *line == *label) {
+    line++; label++; 
+  }
+  /* if *label is not EOS, we must have had a mismatch */
+  if (*label != '\0') return;
+  /* if *line is not a space, actual label is longer than test label */
+  if (!isspace(*line) && *line != '=') return ; 
+  /* skip over white space */
+  while (isspace(*line)) line++;
+  /* next char should be '=' */
+  if (*line != '=') return;
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return;
+  /* finally we've come to the value */
+  strcpy(val, line);
+  /* chop off the newline at the end */
+  n = strlen(val)-1;
+  if (n >= 0 && val[n] == '\n')
+    val[n--] = '\0';
+  if (n >= 0 && val[n] == '\r')
+    val[n--] = '\0';
+  /* treat continuation */
+  while (val[n] == '\\' && fgets(original_line, LL, deffile)) {
+     line = original_line;
+     while (isspace(*line)) line++;
+     if (isspace(*original_line)) val[n++] = ' ';
+     while (*line && *line != '\n' && *line != '\r' && n < LL-1)
+       val[n++] = *line++;
+     val[n] = '\0';
+     n--;
+  }
+/*  if (val[strlen(val) - 1] == '\\') {
+    printf("\n\
+setparams: Error in file make.def. Because of the way in which\n\
+           command line arguments are incorporated into the\n\
+           executable benchmark, you can't have any continued\n\
+           lines in the file make.def, that is, lines ending\n\
+           with the character \"\\\". Although it may be ugly, \n\
+           you should be able to reformat without continuation\n\
+           lines. The offending line is\n\
+  %s\n", original_line);
+    exit(1);
+  } */
+}
+
+int check_include_line(char *line, char *filename)
+{
+  char *include_string = "include";
+  /* compare beginning of line and "include" */
+  while (*include_string != '\0' && *line == *include_string) {
+    line++; include_string++; 
+  }
+  /* if *include_string is not EOS, we must have had a mismatch */
+  if (*include_string != '\0') return(0);
+  /* if *line is not a space, first word is not "include" */
+  if (!isspace(*line)) return(0); 
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return(0);
+  /* next keyword should be name of include file in *filename */
+  while (*filename != '\0' && *line == *filename) {
+    line++; filename++; 
+  }  
+  if (*filename != '\0' || 
+      (*line != ' ' && *line != '\0' && *line !='\n')) return(0);
+  else return(1);
+}
+
+
+#define MAXL 46
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, len, name);
+  fprintf(fp, "%sparameter (%s=\'%s\')\n", FINDENT, name, val);
+}
+
+/* need to escape quote (") in val */
+int fix_string_quote(char *val, char *newval, int maxl)
+{
+  int len;
+  int i, j;
+  len = strlen(val);
+  i = j = 0;
+  while (i < len && j < maxl) {
+    if (val[i] == '"')
+      newval[j++] = '\\';
+    if (j < maxl)
+      newval[j++] = val[i++];
+  }
+  newval[j] = '\0';
+  return j;
+}
+
+/* NOTE: is the ... stuff necessary in C? */
+void put_def_string(FILE *fp, char *name, char *val0)
+{
+  int len;
+  char val[MAXL+3];
+  len = fix_string_quote(val0, val, MAXL+2);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s \"%s\"\n", name, val);
+}
+
+void put_def_variable(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s %s\n", name, val);
+}
+
+
+
+#if 0
+
+/* this version allows arbitrarily long lines but 
+ * some compilers don't like that and they're rarely
+ * useful 
+ */
+
+#define LINELEN 65
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len, nlines, pos, i;
+  char line[100];
+  len = strlen(val);
+  nlines = len/LINELEN;
+  if (nlines*LINELEN < len) nlines++;
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, nlines*LINELEN, name);
+  fprintf(fp, "%sparameter (%s = \n", FINDENT, name);
+  for (i = 0; i < nlines; i++) {
+    pos = i*LINELEN;
+    if (i == 0) fprintf(fp, "%s\'", CONTINUE);
+    else        fprintf(fp, "%s", CONTINUE);
+    /* number should be same as LINELEN */
+    fprintf(fp, "%.65s", val+pos);
+    if (i == nlines-1) fprintf(fp, "\')\n");
+    else             fprintf(fp, "\n");
+  }
+}
+
+#endif
+
+
+/* integer square root. Return error if argument isn't
+ * a perfect square or is less than or equal to zero 
+ */
+
+int isqrt(int i)
+{
+  int root, square;
+  if (i <= 0) return(-1);
+  square = 0;
+  for (root = 1; square <= i; root++) {
+    square = root*root;
+    if (square == i) return(root);
+  }
+  return(-1);
+}
+
+int isqrt2(int i)
+{
+  int xdim, ydim, square;
+  if (i <= 0) return(-1);
+  square = 0;
+  for (xdim = 1; square <= i; xdim++) {
+    square = xdim*xdim;
+    if (square == i) return(xdim);
+  }
+  ydim = i / (--xdim);
+  while (xdim*ydim != i && 2*ydim >= xdim) {
+    xdim++;
+    ydim = i / xdim;
+  }
+  if (xdim*ydim == i && 2*ydim >= xdim)
+    return(xdim);
+  return(-1);
+}
+  
+
+/* integer log base two. Return error is argument isn't
+ * a power of two or is less than or equal to zero 
+ */
+
+int ilog2(int i)
+{
+  int log2;
+  int exp2 = 1;
+  if (i <= 0) return(-1);
+
+  for (log2 = 0; log2 < 30; log2++) {
+    if (exp2 == i) return(log2);
+    if (exp2 > i) break;
+    exp2 *= 2;
+  }
+  return(-1);
+}
+
+int ipow2(int i)
+{
+  int pow2 = 1;
+  if (i < 0) return(-1);
+  if (i == 0) return(1);
+  while(i--) pow2 *= 2;
+  return(pow2);
+}
+ 
+
+
+void write_convertdouble_info(int type, FILE *fp)
+{
+  switch(type) {
+  case SP:
+  case BT:
+  case LU:
+  case FT:
+  case MG:
+  case EP:
+  case CG:
+    fprintf(fp, "%slogical  convertdouble\n", FINDENT);
+#ifdef CONVERTDOUBLE
+    fprintf(fp, "%sparameter (convertdouble = .true.)\n", FINDENT);
+#else
+    fprintf(fp, "%sparameter (convertdouble = .false.)\n", FINDENT);
+#endif
+    break;
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/suite.awk b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/suite.awk
new file mode 100644
index 0000000..2e5fc31
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-MPI/sys/suite.awk
@@ -0,0 +1,20 @@
+BEGIN { SMAKE = "make" } {
+  if ($1 !~ /^#/ &&  NF > 2) {
+    printf "cd `echo %s|tr '[a-z]' '[A-Z]'`; %s clean;", $1, SMAKE;
+    printf "%s CLASS=%s NPROCS=%s", SMAKE, $2, $3;
+    if ( NF > 3 ) {
+      if ( $4 ~ /^vec/ ||  $4 ~ /^VEC/ ) {
+        printf " VERSION=%s", $4;
+        if ( NF > 4 ) {
+          printf " SUBTYPE=%s", $5;
+        }
+      } else {
+        printf " SUBTYPE=%s", $4;
+        if ( NF > 4 ) {
+          printf " VERSION=%s", $5;
+        }
+      }
+    }
+    printf "; cd ..\n";
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/Makefile
new file mode 100644
index 0000000..09cf5b8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/Makefile
@@ -0,0 +1,68 @@
+SHELL=/bin/sh
+BENCHMARK=bt
+BENCHMARKU=BT
+VEC=
+
+include ../config/make.def
+
+
+OBJS = bt.o  initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o  rhs.o      \
+       x_solve$(VEC).o y_solve$(VEC).o solve_subs.o  \
+       z_solve$(VEC).o add.o error.o verify.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+	OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect).
+
+# header.h: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o:
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS $<
+else
+	${FCOMPILE} $<
+endif
+
+
+bt.o:             bt.f  header.h npbparams.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+rhs.o:            rhs.f  header.h npbparams.h
+x_solve$(VEC).o:  x_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+y_solve$(VEC).o:  y_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+z_solve$(VEC).o:  z_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+solve_subs.o:     solve_subs.f  npbparams.h
+add.o:            add.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h
+verify.o:         verify.f  header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/add.f
new file mode 100644
index 0000000..dcad620
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/add.f
@@ -0,0 +1,31 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     addition of update to the vector u
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m
+
+      if (timeron) call timer_start(t_add)
+!$omp parallel do default(shared) private(i,j,k,m)
+      do     k = 1, grid_points(3)-2
+         do     j = 1, grid_points(2)-2
+            do     i = 1, grid_points(1)-2
+               do    m = 1, 5
+                  u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_add)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/adi.f
new file mode 100644
index 0000000..4b45494
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/adi.f
@@ -0,0 +1,21 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      call compute_rhs
+
+      call x_solve
+
+      call y_solve
+
+      call z_solve
+
+      call add
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/bt.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/bt.f
new file mode 100644
index 0000000..a89209c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/bt.f
@@ -0,0 +1,222 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   B T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB BT code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: R. Van der Wijngaart
+c          T. Harris
+c          M. Yarrow
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program BT
+c---------------------------------------------------------------------
+
+       include  'header.h'
+      
+       integer i, niter, step, fstatus
+       double precision navg, mflops, n3
+
+       external timer_read
+       double precision tmax, timer_read, t, trecs(t_last)
+       logical verified
+       character class
+       character t_names(t_last)*8
+!$     integer  omp_get_max_threads
+!$     external omp_get_max_threads
+
+c---------------------------------------------------------------------
+c      Root node reads input file (if it exists) else takes
+c      defaults from parameters
+c---------------------------------------------------------------------
+          
+       open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+       if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_rdis2) = 'redist2'
+         t_names(t_add) = 'add'
+         close(2)
+       else
+         timeron = .false.
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputbt.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233) 
+ 233     format(' Reading from input file inputbt.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234) 
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputbt.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+!$     write(*, 1003) omp_get_max_threads()
+       write(*, *)
+
+ 1000  format(//, ' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >            ' - BT Benchmark', /)
+ 1001  format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002  format(' Iterations: ', i4, '       dt: ', F11.7)
+ 1003  format(' Number of available threads: ', i5)
+
+       if ( (grid_points(1) .gt. IMAX) .or.
+     >      (grid_points(2) .gt. JMAX) .or.
+     >      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+
+
+       call set_constants
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call initialize
+
+       call exact_rhs
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+#ifdef HOOKS
+       call roi_begin
+#endif
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or. 
+     >        step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       tmax = timer_read(1)
+
+#ifdef HOOKS
+       call roi_end
+#endif
+       
+       call verify(niter, class, verified)
+
+       n3 = 1.0d0*grid_points(1)*grid_points(2)*grid_points(3)
+       navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.0
+       if( tmax .ne. 0. ) then
+          mflops = 1.0e-6*float(niter)*
+     >  (3478.8*n3-17655.7*navg**2+28023.7*navg)
+     >  / tmax
+       else
+          mflops = 0.0
+       endif
+       call print_results('BT', class, grid_points(1), 
+     >  grid_points(2), grid_points(3), niter,
+     >  tmax, mflops, '          floating point', 
+     >  verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >  cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+       if (tmax .eq. 0.0) tmax = 1.0
+
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_zsolve) then
+             t = trecs(t_zsolve) - trecs(t_rdis1) - trecs(t_rdis2)
+             write(*,820) 'sub-zsol', t, t*100./tmax
+          elseif (i.eq.t_rdis2) then
+             t = trecs(t_rdis1) + trecs(t_rdis2)
+             write(*,820) 'redist', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/error.f
new file mode 100644
index 0000000..994f1ce
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/error.f
@@ -0,0 +1,114 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function computes the norm of the difference between the
+c     computed solution and the exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m, d
+      double precision xi, eta, zeta, u_exact(5), rms(5), add
+      double precision rms_local(5)
+
+      do m = 1, 5
+         rms(m) = 0.0d0
+      enddo
+
+!$omp parallel default(shared)
+!$omp& private(i,j,k,m,zeta,eta,xi,add,u_exact,rms_local)
+!$omp&        shared(rms)
+      do m = 1, 5
+         rms_local(m) = 0.0d0
+      enddo
+!$omp do
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+               call exact_solution(xi, eta, zeta, u_exact)
+
+               do m = 1, 5
+                  add = u(m,i,j,k)-u_exact(m)
+                  rms_local(m) = rms_local(m) + add*add
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+      do m = 1, 5
+!$omp atomic
+         rms(m) = rms(m) + rms_local(m)
+      enddo
+!$omp end parallel
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo
+         rms(m) = dsqrt(rms(m))
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, d, m
+      double precision rms(5), add
+      double precision rms_local(5)
+
+      do m = 1, 5
+         rms(m) = 0.0d0
+      enddo
+
+!$omp parallel default(shared) private(i,j,k,m,add,rms_local)
+!$omp&        shared(rms)
+      do m = 1, 5
+         rms_local(m) = 0.0d0
+      enddo
+!$omp do
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  add = rhs(m,i,j,k)
+                  rms_local(m) = rms_local(m) + add*add
+               enddo 
+            enddo 
+         enddo 
+      enddo 
+!$omp end do nowait
+      do m = 1, 5
+!$omp atomic
+         rms(m) = rms(m) + rms_local(m)
+      enddo
+!$omp end parallel
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo 
+         rms(m) = dsqrt(rms(m))
+      enddo 
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f
new file mode 100644
index 0000000..83bfe4f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f
@@ -0,0 +1,350 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dtemp(5), xi, eta, zeta, dtpp
+      integer m, i, j, k, ip1, im1, jp1, jm1, km1, kp1
+
+!$omp parallel default(shared) private(i,j,k,m,zeta,eta,xi,
+!$omp&  dtpp,im1,ip1,jm1,jp1,km1,kp1,dtemp)
+c---------------------------------------------------------------------
+c     initialize                                  
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k= 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  forcing(m,i,j,k) = 0.0d0
+               enddo
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     xi-direction flux differences                      
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         zeta = dble(k) * dnzm1
+         do j = 1, grid_points(2)-2
+            eta = dble(j) * dnym1
+
+            do i=0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(i,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0 / dtemp(1)
+
+               do m = 2, 5
+                  buf(i,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(i)   = buf(i,2) * buf(i,2)
+               buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                 buf(i,4) * buf(i,4) 
+               q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                 buf(i,4)*ue(i,4))
+
+            enddo
+               
+            do i = 1, grid_points(1)-2
+               im1 = i-1
+               ip1 = i+1
+
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (
+     >                 (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                 (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                 buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                         
+c---------------------------------------------------------------------
+
+            do m = 1, 5
+               i = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+               i = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                    4.0d0*ue(i+1,m) +       ue(i+2,m))
+            enddo
+
+            do i = 3, grid_points(1)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               i = grid_points(1)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+               i = grid_points(1)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+            enddo
+
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c     eta-direction flux differences             
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2          
+         zeta = dble(k) * dnzm1
+         do i=1, grid_points(1)-2
+            xi = dble(i) * dnxm1
+
+            do j=0, grid_points(2)-1
+               eta = dble(j) * dnym1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5 
+                  ue(j,m) = dtemp(m)
+               enddo
+                  
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(j,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(j)   = buf(j,3) * buf(j,3)
+               buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                 buf(j,4) * buf(j,4)
+               q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                 buf(j,4)*ue(j,4))
+            enddo
+
+            do j = 1, grid_points(2)-2
+               jm1 = j-1
+               jp1 = j+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                 dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(
+     >                 ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                 dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(
+     >                 (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                 (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                 yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                 dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(
+     >                 ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                 dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(
+     >                 buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                 buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                 buf(jm1,1))+
+     >                 yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                 yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                 dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                      
+c---------------------------------------------------------------------
+            do m = 1, 5
+               j = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+               j = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                    4.0d0*ue(j+1,m) +       ue(j+2,m))
+            enddo
+
+            do j = 3, grid_points(2)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               j = grid_points(2)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+               j = grid_points(2)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+            enddo
+
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     zeta-direction flux differences                      
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do j=1, grid_points(2)-2
+         eta = dble(j) * dnym1
+         do i = 1, grid_points(1)-2
+            xi = dble(i) * dnxm1
+
+            do k=0, grid_points(3)-1
+               zeta = dble(k) * dnzm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(k,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(k,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(k)   = buf(k,4) * buf(k,4)
+               buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + 
+     >                 buf(k,3) * buf(k,3)
+               q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                 buf(k,4)*ue(k,4))
+            enddo
+
+            do k=1, grid_points(3)-2
+               km1 = k-1
+               kp1 = k+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (
+     >                 (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                 (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                 zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                 dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                 +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                        
+c---------------------------------------------------------------------
+            do m = 1, 5
+               k = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+               k = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                    4.0d0*ue(k+1,m) +       ue(k+2,m))
+            enddo
+
+            do k = 3, grid_points(3)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               k = grid_points(3)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+               k = grid_points(3)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+            enddo
+
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f
new file mode 100644
index 0000000..b093b46
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f
@@ -0,0 +1,29 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision  xi, eta, zeta, dtemp(5)
+      integer m
+
+      do m = 1, 5
+         dtemp(m) =  ce(m,1) +
+     >     xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >     eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >     zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >     zeta*ce(m,13))))
+      enddo
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/header.h
new file mode 100644
index 0000000..3528271
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/header.h
@@ -0,0 +1,106 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  header.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      maxcells:      the square root of the maximum number of processors
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter        (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           grid_points(3)
+      double precision  elapsed_time
+      common /global/   elapsed_time, grid_points, timeron
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX, IMAXP, JMAXP
+
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+c
+c   to improve cache performance, grid dimensions padded by 1 
+c   for even number sizes only.
+c
+      double precision 
+     >   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1)
+      common /fields/  u, us, vs, ws, qs, rho_i, square, 
+     >                 rhs, forcing
+
+      double precision cuf(0:problem_size),   q  (0:problem_size),
+     >                 ue (0:problem_size,5), buf(0:problem_size,5)
+      common /work_1d/ cuf, q, ue, buf
+!$omp threadprivate (/work_1d/)
+c
+
+c-----------------------------------------------------------------------
+c   Timer constants
+c-----------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_xsolve,t_ysolve,t_zsolve,
+     >        t_rdis1,t_rdis2,t_add,
+     >        t_rhs,t_last,t_total
+      logical timeron
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_rdis2 = 10)
+      parameter (t_add = 11)
+      parameter (t_last = 11)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/initialize.f
new file mode 100644
index 0000000..b65603e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/initialize.f
@@ -0,0 +1,245 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This subroutine initializes the field variable u using 
+c     tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      integer i, j, k, m, ix, iy, iz
+      double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >     Pzeta, temp(5)
+
+
+!$omp parallel default(shared)
+!$omp& private(i,j,k,m,zeta,eta,xi,ix,iy,iz,Pface,Pxi,Peta,Pzeta,temp)
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  u(m,i,j,k) = 1.0
+               end do
+            end do
+         end do
+      end do
+!$omp end do
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c     first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+                  
+               do ix = 1, 2
+                  call exact_solution(dble(ix-1), eta, zeta, 
+     >                    Pface(1,1,ix))
+               enddo
+
+               do iy = 1, 2
+                  call exact_solution(xi, dble(iy-1) , zeta, 
+     >                    Pface(1,2,iy))
+               enddo
+
+               do iz = 1, 2
+                  call exact_solution(xi, eta, dble(iz-1),   
+     >                    Pface(1,3,iz))
+               enddo
+
+               do m = 1, 5
+                  Pxi   = xi   * Pface(m,1,2) + 
+     >                    (1.0d0-xi)   * Pface(m,1,1)
+                  Peta  = eta  * Pface(m,2,2) + 
+     >                    (1.0d0-eta)  * Pface(m,2,1)
+                  Pzeta = zeta * Pface(m,3,2) + 
+     >                    (1.0d0-zeta) * Pface(m,3,1)
+                     
+                  u(m,i,j,k) = Pxi + Peta + Pzeta - 
+     >                    Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                    Pxi*Peta*Pzeta
+
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c     now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     west face                                                  
+c---------------------------------------------------------------------
+      i = 0
+      xi = 0.0d0
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c     east face                                                      
+c---------------------------------------------------------------------
+
+      i = grid_points(1)-1
+      xi = 1.0d0
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c     south face                                                 
+c---------------------------------------------------------------------
+      j = 0
+      eta = 0.0d0
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do i = 0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+
+c---------------------------------------------------------------------
+c     north face                                    
+c---------------------------------------------------------------------
+      j = grid_points(2)-1
+      eta = 1.0d0
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do i = 0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+c---------------------------------------------------------------------
+c     bottom face                                       
+c---------------------------------------------------------------------
+      k = 0
+      zeta = 0.0d0
+!$omp do schedule(static)
+      do j = 0, grid_points(2)-1
+         eta = dble(j) * dnym1
+         do i =0, grid_points(1)-1
+            xi = dble(i) *dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c     top face     
+c---------------------------------------------------------------------
+      k = grid_points(3)-1
+      zeta = 1.0d0
+!$omp do schedule(static)
+      do j = 0, grid_points(2)-1
+         eta = dble(j) * dnym1
+         do i =0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine lhsinit(lhs, ni)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      integer i, m, n, ni
+      double precision lhs(5,5,3,0:ni)
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      i = 0
+      do m = 1, 5
+         do n = 1, 5
+            lhs(m,n,1,i) = 0.0d0
+            lhs(m,n,2,i) = 0.0d0
+            lhs(m,n,3,i) = 0.0d0
+         end do
+         lhs(m,m,2,i) = 1.0d0
+      end do
+      i = ni
+      do m = 1, 5
+         do n = 1, 5
+            lhs(m,n,1,i) = 0.0d0
+            lhs(m,n,2,i) = 0.0d0
+            lhs(m,n,3,i) = 0.0d0
+         end do
+         lhs(m,m,2,i) = 1.0d0
+      end do
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/inputbt.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/inputbt.data.sample
new file mode 100644
index 0000000..d47ca91
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/inputbt.data.sample
@@ -0,0 +1,3 @@
+60       number of time steps
+0.01d0   dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
+12 12 12
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/rhs.f
new file mode 100644
index 0000000..535bdba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/rhs.f
@@ -0,0 +1,434 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m
+      double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >     wijk, wp1, wm1
+
+
+      if (timeron) call timer_start(t_rhs)
+!$omp parallel default(shared) private(i,j,k,m,rho_inv,uijk,up1,um1,
+!$omp&   vijk,vp1,vm1,wijk,wp1,wm1)
+c---------------------------------------------------------------------
+c     compute the reciprocal of density, and the kinetic energy, 
+c     and the speed of sound.
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               rho_inv = 1.0d0/u(1,i,j,k)
+               rho_i(i,j,k) = rho_inv
+               us(i,j,k) = u(2,i,j,k) * rho_inv
+               vs(i,j,k) = u(3,i,j,k) * rho_inv
+               ws(i,j,k) = u(4,i,j,k) * rho_inv
+               square(i,j,k)     = 0.5d0* (
+     >                 u(2,i,j,k)*u(2,i,j,k) + 
+     >                 u(3,i,j,k)*u(3,i,j,k) +
+     >                 u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+               qs(i,j,k) = square(i,j,k) * rho_inv
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole grid
+c including the boundary                   
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  rhs(m,i,j,k) = forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+
+!$omp master
+      if (timeron) call timer_start(t_rhsx)
+!$omp end master
+c---------------------------------------------------------------------
+c     compute xi-direction fluxes 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               uijk = us(i,j,k)
+               up1  = us(i+1,j,k)
+               um1  = us(i-1,j,k)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 * 
+     >                 (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i-1,j,k)) -
+     >                 tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 * 
+     >                 (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i-1,j,k)) +
+     >                 xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                 tx2 * (u(2,i+1,j,k)*up1 - 
+     >                 u(2,i-1,j,k)*um1 +
+     >                 (u(5,i+1,j,k)- square(i+1,j,k)-
+     >                 u(5,i-1,j,k)+ square(i-1,j,k))*
+     >                 c2)
+
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 * 
+     >                 (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +
+     >                 u(3,i-1,j,k)) +
+     >                 xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +
+     >                 vs(i-1,j,k)) -
+     >                 tx2 * (u(3,i+1,j,k)*up1 - 
+     >                 u(3,i-1,j,k)*um1)
+
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 * 
+     >                 (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +
+     >                 u(4,i-1,j,k)) +
+     >                 xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +
+     >                 ws(i-1,j,k)) -
+     >                 tx2 * (u(4,i+1,j,k)*up1 - 
+     >                 u(4,i-1,j,k)*um1)
+
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 * 
+     >                 (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +
+     >                 u(5,i-1,j,k)) +
+     >                 xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +
+     >                 qs(i-1,j,k)) +
+     >                 xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                 um1*um1) +
+     >                 xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i-1,j,k)*rho_i(i-1,j,k)) -
+     >                 tx2 * ( (c1*u(5,i+1,j,k) - 
+     >                 c2*square(i+1,j,k))*up1 -
+     >                 (c1*u(5,i-1,j,k) - 
+     >                 c2*square(i-1,j,k))*um1 )
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            i = 1
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +
+     >                    u(m,i+2,j,k))
+            enddo
+
+            i = 2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+            enddo
+         enddo
+
+         do j = 1, grid_points(2)-2
+            do i = 3,grid_points(1)-4
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) + 
+     >                    u(m,i+2,j,k) )
+               enddo
+            enddo
+         enddo
+         
+         do j = 1, grid_points(2)-2
+            i = grid_points(1)-3
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+            enddo
+
+            i = grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+!$omp end master
+c---------------------------------------------------------------------
+c     compute eta-direction fluxes 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               vijk = vs(i,j,k)
+               vp1  = vs(i,j+1,k)
+               vm1  = vs(i,j-1,k)
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 * 
+     >                 (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i,j-1,k)) -
+     >                 ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 * 
+     >                 (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i,j-1,k)) +
+     >                 yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) + 
+     >                 us(i,j-1,k)) -
+     >                 ty2 * (u(2,i,j+1,k)*vp1 - 
+     >                 u(2,i,j-1,k)*vm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 * 
+     >                 (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) + 
+     >                 u(3,i,j-1,k)) +
+     >                 yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                 ty2 * (u(3,i,j+1,k)*vp1 - 
+     >                 u(3,i,j-1,k)*vm1 +
+     >                 (u(5,i,j+1,k) - square(i,j+1,k) - 
+     >                 u(5,i,j-1,k) + square(i,j-1,k))
+     >                 *c2)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 * 
+     >                 (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) + 
+     >                 u(4,i,j-1,k)) +
+     >                 yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) + 
+     >                 ws(i,j-1,k)) -
+     >                 ty2 * (u(4,i,j+1,k)*vp1 - 
+     >                 u(4,i,j-1,k)*vm1)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 * 
+     >                 (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) + 
+     >                 u(5,i,j-1,k)) +
+     >                 yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) + 
+     >                 qs(i,j-1,k)) +
+     >                 yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                 vm1*vm1) +
+     >                 yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i,j-1,k)*rho_i(i,j-1,k)) -
+     >                 ty2 * ((c1*u(5,i,j+1,k) - 
+     >                 c2*square(i,j+1,k)) * vp1 -
+     >                 (c1*u(5,i,j-1,k) - 
+     >                 c2*square(i,j-1,k)) * vm1)
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+         j = 1
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +
+     >                    u(m,i,j+2,k))
+            enddo
+         enddo
+
+         j = 2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+            enddo
+         enddo
+
+         do j = 3, grid_points(2)-4
+            do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) + 
+     >                    u(m,i,j+2,k) )
+               enddo
+            enddo
+         enddo
+         
+         j = grid_points(2)-3
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+            enddo
+         enddo
+
+         j = grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+!$omp end master
+c---------------------------------------------------------------------
+c     compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               wijk = ws(i,j,k)
+               wp1  = ws(i,j,k+1)
+               wm1  = ws(i,j,k-1)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 * 
+     >                 (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i,j,k-1)) -
+     >                 tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 * 
+     >                 (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i,j,k-1)) +
+     >                 zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) + 
+     >                 us(i,j,k-1)) -
+     >                 tz2 * (u(2,i,j,k+1)*wp1 - 
+     >                 u(2,i,j,k-1)*wm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 * 
+     >                 (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) + 
+     >                 u(3,i,j,k-1)) +
+     >                 zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) + 
+     >                 vs(i,j,k-1)) -
+     >                 tz2 * (u(3,i,j,k+1)*wp1 - 
+     >                 u(3,i,j,k-1)*wm1)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 * 
+     >                 (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) + 
+     >                 u(4,i,j,k-1)) +
+     >                 zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                 tz2 * (u(4,i,j,k+1)*wp1 - 
+     >                 u(4,i,j,k-1)*wm1 +
+     >                 (u(5,i,j,k+1) - square(i,j,k+1) - 
+     >                 u(5,i,j,k-1) + square(i,j,k-1))
+     >                 *c2)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 * 
+     >                 (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) + 
+     >                 u(5,i,j,k-1)) +
+     >                 zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) + 
+     >                 qs(i,j,k-1)) +
+     >                 zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                 wm1*wm1) +
+     >                 zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i,j,k-1)*rho_i(i,j,k-1)) -
+     >                 tz2 * ( (c1*u(5,i,j,k+1) - 
+     >                 c2*square(i,j,k+1))*wp1 -
+     >                 (c1*u(5,i,j,k-1) - 
+     >                 c2*square(i,j,k-1))*wm1)
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+c---------------------------------------------------------------------
+c     add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+      k = 1
+!$omp do schedule(static)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +
+     >                    u(m,i,j,k+2))
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+      k = 2
+!$omp do schedule(static)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!$omp do schedule(static)
+      do k = 3, grid_points(3)-4
+         do j = 1, grid_points(2)-2
+            do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) + 
+     >                    u(m,i,j,k+2) )
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+         
+      k = grid_points(3)-3
+!$omp do schedule(static)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+      k = grid_points(3)-2
+!$omp do schedule(static)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+
+!$omp do schedule(static)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) * dt
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/set_constants.f
new file mode 100644
index 0000000..6492e42
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/set_constants.f
@@ -0,0 +1,200 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      ce(1,1)  = 2.0d0
+      ce(1,2)  = 0.0d0
+      ce(1,3)  = 0.0d0
+      ce(1,4)  = 4.0d0
+      ce(1,5)  = 5.0d0
+      ce(1,6)  = 3.0d0
+      ce(1,7)  = 0.5d0
+      ce(1,8)  = 0.02d0
+      ce(1,9)  = 0.01d0
+      ce(1,10) = 0.03d0
+      ce(1,11) = 0.5d0
+      ce(1,12) = 0.4d0
+      ce(1,13) = 0.3d0
+      
+      ce(2,1)  = 1.0d0
+      ce(2,2)  = 0.0d0
+      ce(2,3)  = 0.0d0
+      ce(2,4)  = 0.0d0
+      ce(2,5)  = 1.0d0
+      ce(2,6)  = 2.0d0
+      ce(2,7)  = 3.0d0
+      ce(2,8)  = 0.01d0
+      ce(2,9)  = 0.03d0
+      ce(2,10) = 0.02d0
+      ce(2,11) = 0.4d0
+      ce(2,12) = 0.3d0
+      ce(2,13) = 0.5d0
+
+      ce(3,1)  = 2.0d0
+      ce(3,2)  = 2.0d0
+      ce(3,3)  = 0.0d0
+      ce(3,4)  = 0.0d0
+      ce(3,5)  = 0.0d0
+      ce(3,6)  = 2.0d0
+      ce(3,7)  = 3.0d0
+      ce(3,8)  = 0.04d0
+      ce(3,9)  = 0.03d0
+      ce(3,10) = 0.05d0
+      ce(3,11) = 0.3d0
+      ce(3,12) = 0.5d0
+      ce(3,13) = 0.4d0
+
+      ce(4,1)  = 2.0d0
+      ce(4,2)  = 2.0d0
+      ce(4,3)  = 0.0d0
+      ce(4,4)  = 0.0d0
+      ce(4,5)  = 0.0d0
+      ce(4,6)  = 2.0d0
+      ce(4,7)  = 3.0d0
+      ce(4,8)  = 0.03d0
+      ce(4,9)  = 0.05d0
+      ce(4,10) = 0.04d0
+      ce(4,11) = 0.2d0
+      ce(4,12) = 0.1d0
+      ce(4,13) = 0.3d0
+
+      ce(5,1)  = 5.0d0
+      ce(5,2)  = 4.0d0
+      ce(5,3)  = 3.0d0
+      ce(5,4)  = 2.0d0
+      ce(5,5)  = 0.1d0
+      ce(5,6)  = 0.4d0
+      ce(5,7)  = 0.3d0
+      ce(5,8)  = 0.05d0
+      ce(5,9)  = 0.04d0
+      ce(5,10) = 0.03d0
+      ce(5,11) = 0.1d0
+      ce(5,12) = 0.3d0
+      ce(5,13) = 0.2d0
+
+      c1 = 1.4d0
+      c2 = 0.4d0
+      c3 = 0.1d0
+      c4 = 1.0d0
+      c5 = 1.4d0
+
+      dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+      dnym1 = 1.0d0 / dble(grid_points(2)-1)
+      dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+      c1c2 = c1 * c2
+      c1c5 = c1 * c5
+      c3c4 = c3 * c4
+      c1345 = c1c5 * c3c4
+
+      conz1 = (1.0d0-c1c5)
+
+      tx1 = 1.0d0 / (dnxm1 * dnxm1)
+      tx2 = 1.0d0 / (2.0d0 * dnxm1)
+      tx3 = 1.0d0 / dnxm1
+
+      ty1 = 1.0d0 / (dnym1 * dnym1)
+      ty2 = 1.0d0 / (2.0d0 * dnym1)
+      ty3 = 1.0d0 / dnym1
+      
+      tz1 = 1.0d0 / (dnzm1 * dnzm1)
+      tz2 = 1.0d0 / (2.0d0 * dnzm1)
+      tz3 = 1.0d0 / dnzm1
+
+      dx1 = 0.75d0
+      dx2 = 0.75d0
+      dx3 = 0.75d0
+      dx4 = 0.75d0
+      dx5 = 0.75d0
+
+      dy1 = 0.75d0
+      dy2 = 0.75d0
+      dy3 = 0.75d0
+      dy4 = 0.75d0
+      dy5 = 0.75d0
+
+      dz1 = 1.0d0
+      dz2 = 1.0d0
+      dz3 = 1.0d0
+      dz4 = 1.0d0
+      dz5 = 1.0d0
+
+      dxmax = dmax1(dx3, dx4)
+      dymax = dmax1(dy2, dy4)
+      dzmax = dmax1(dz2, dz3)
+
+      dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+      c4dssp = 4.0d0 * dssp
+      c5dssp = 5.0d0 * dssp
+
+      dttx1 = dt*tx1
+      dttx2 = dt*tx2
+      dtty1 = dt*ty1
+      dtty2 = dt*ty2
+      dttz1 = dt*tz1
+      dttz2 = dt*tz2
+
+      c2dttx1 = 2.0d0*dttx1
+      c2dtty1 = 2.0d0*dtty1
+      c2dttz1 = 2.0d0*dttz1
+
+      dtdssp = dt*dssp
+
+      comz1  = dtdssp
+      comz4  = 4.0d0*dtdssp
+      comz5  = 5.0d0*dtdssp
+      comz6  = 6.0d0*dtdssp
+
+      c3c4tx3 = c3c4*tx3
+      c3c4ty3 = c3c4*ty3
+      c3c4tz3 = c3c4*tz3
+
+      dx1tx1 = dx1*tx1
+      dx2tx1 = dx2*tx1
+      dx3tx1 = dx3*tx1
+      dx4tx1 = dx4*tx1
+      dx5tx1 = dx5*tx1
+      
+      dy1ty1 = dy1*ty1
+      dy2ty1 = dy2*ty1
+      dy3ty1 = dy3*ty1
+      dy4ty1 = dy4*ty1
+      dy5ty1 = dy5*ty1
+      
+      dz1tz1 = dz1*tz1
+      dz2tz1 = dz2*tz1
+      dz3tz1 = dz3*tz1
+      dz4tz1 = dz4*tz1
+      dz5tz1 = dz5*tz1
+
+      c2iv  = 2.5d0
+      con43 = 4.0d0/3.0d0
+      con16 = 1.0d0/6.0d0
+      
+      xxcon1 = c3c4tx3*con43*tx3
+      xxcon2 = c3c4tx3*tx3
+      xxcon3 = c3c4tx3*conz1*tx3
+      xxcon4 = c3c4tx3*con16*tx3
+      xxcon5 = c3c4tx3*c1c5*tx3
+
+      yycon1 = c3c4ty3*con43*ty3
+      yycon2 = c3c4ty3*ty3
+      yycon3 = c3c4ty3*conz1*ty3
+      yycon4 = c3c4ty3*con16*ty3
+      yycon5 = c3c4ty3*c1c5*ty3
+
+      zzcon1 = c3c4tz3*con43*tz3
+      zzcon2 = c3c4tz3*tz3
+      zzcon3 = c3c4tz3*conz1*tz3
+      zzcon4 = c3c4tz3*con16*tz3
+      zzcon5 = c3c4tz3*c1c5*tz3
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f
new file mode 100644
index 0000000..b2e5479
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f
@@ -0,0 +1,642 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts bvec=bvec - ablock*avec
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock,avec,bvec
+      dimension ablock(5,5),avec(5),bvec(5)
+
+c---------------------------------------------------------------------
+c            rhs(i,ic,jc,kc) = rhs(i,ic,jc,kc) 
+c     $           - lhs(i,1,ablock,ia)*
+c---------------------------------------------------------------------
+         bvec(1) = bvec(1) - ablock(1,1)*avec(1)
+     >                     - ablock(1,2)*avec(2)
+     >                     - ablock(1,3)*avec(3)
+     >                     - ablock(1,4)*avec(4)
+     >                     - ablock(1,5)*avec(5)
+         bvec(2) = bvec(2) - ablock(2,1)*avec(1)
+     >                     - ablock(2,2)*avec(2)
+     >                     - ablock(2,3)*avec(3)
+     >                     - ablock(2,4)*avec(4)
+     >                     - ablock(2,5)*avec(5)
+         bvec(3) = bvec(3) - ablock(3,1)*avec(1)
+     >                     - ablock(3,2)*avec(2)
+     >                     - ablock(3,3)*avec(3)
+     >                     - ablock(3,4)*avec(4)
+     >                     - ablock(3,5)*avec(5)
+         bvec(4) = bvec(4) - ablock(4,1)*avec(1)
+     >                     - ablock(4,2)*avec(2)
+     >                     - ablock(4,3)*avec(3)
+     >                     - ablock(4,4)*avec(4)
+     >                     - ablock(4,5)*avec(5)
+         bvec(5) = bvec(5) - ablock(5,1)*avec(1)
+     >                     - ablock(5,2)*avec(2)
+     >                     - ablock(5,3)*avec(3)
+     >                     - ablock(5,4)*avec(4)
+     >                     - ablock(5,5)*avec(5)
+
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock, bblock, cblock
+      dimension ablock(5,5), bblock(5,5), cblock(5,5)
+
+
+         cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1)
+     >                             - ablock(1,2)*bblock(2,1)
+     >                             - ablock(1,3)*bblock(3,1)
+     >                             - ablock(1,4)*bblock(4,1)
+     >                             - ablock(1,5)*bblock(5,1)
+         cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1)
+     >                             - ablock(2,2)*bblock(2,1)
+     >                             - ablock(2,3)*bblock(3,1)
+     >                             - ablock(2,4)*bblock(4,1)
+     >                             - ablock(2,5)*bblock(5,1)
+         cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1)
+     >                             - ablock(3,2)*bblock(2,1)
+     >                             - ablock(3,3)*bblock(3,1)
+     >                             - ablock(3,4)*bblock(4,1)
+     >                             - ablock(3,5)*bblock(5,1)
+         cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1)
+     >                             - ablock(4,2)*bblock(2,1)
+     >                             - ablock(4,3)*bblock(3,1)
+     >                             - ablock(4,4)*bblock(4,1)
+     >                             - ablock(4,5)*bblock(5,1)
+         cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1)
+     >                             - ablock(5,2)*bblock(2,1)
+     >                             - ablock(5,3)*bblock(3,1)
+     >                             - ablock(5,4)*bblock(4,1)
+     >                             - ablock(5,5)*bblock(5,1)
+         cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2)
+     >                             - ablock(1,2)*bblock(2,2)
+     >                             - ablock(1,3)*bblock(3,2)
+     >                             - ablock(1,4)*bblock(4,2)
+     >                             - ablock(1,5)*bblock(5,2)
+         cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2)
+     >                             - ablock(2,2)*bblock(2,2)
+     >                             - ablock(2,3)*bblock(3,2)
+     >                             - ablock(2,4)*bblock(4,2)
+     >                             - ablock(2,5)*bblock(5,2)
+         cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2)
+     >                             - ablock(3,2)*bblock(2,2)
+     >                             - ablock(3,3)*bblock(3,2)
+     >                             - ablock(3,4)*bblock(4,2)
+     >                             - ablock(3,5)*bblock(5,2)
+         cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2)
+     >                             - ablock(4,2)*bblock(2,2)
+     >                             - ablock(4,3)*bblock(3,2)
+     >                             - ablock(4,4)*bblock(4,2)
+     >                             - ablock(4,5)*bblock(5,2)
+         cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2)
+     >                             - ablock(5,2)*bblock(2,2)
+     >                             - ablock(5,3)*bblock(3,2)
+     >                             - ablock(5,4)*bblock(4,2)
+     >                             - ablock(5,5)*bblock(5,2)
+         cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3)
+     >                             - ablock(1,2)*bblock(2,3)
+     >                             - ablock(1,3)*bblock(3,3)
+     >                             - ablock(1,4)*bblock(4,3)
+     >                             - ablock(1,5)*bblock(5,3)
+         cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3)
+     >                             - ablock(2,2)*bblock(2,3)
+     >                             - ablock(2,3)*bblock(3,3)
+     >                             - ablock(2,4)*bblock(4,3)
+     >                             - ablock(2,5)*bblock(5,3)
+         cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3)
+     >                             - ablock(3,2)*bblock(2,3)
+     >                             - ablock(3,3)*bblock(3,3)
+     >                             - ablock(3,4)*bblock(4,3)
+     >                             - ablock(3,5)*bblock(5,3)
+         cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3)
+     >                             - ablock(4,2)*bblock(2,3)
+     >                             - ablock(4,3)*bblock(3,3)
+     >                             - ablock(4,4)*bblock(4,3)
+     >                             - ablock(4,5)*bblock(5,3)
+         cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3)
+     >                             - ablock(5,2)*bblock(2,3)
+     >                             - ablock(5,3)*bblock(3,3)
+     >                             - ablock(5,4)*bblock(4,3)
+     >                             - ablock(5,5)*bblock(5,3)
+         cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4)
+     >                             - ablock(1,2)*bblock(2,4)
+     >                             - ablock(1,3)*bblock(3,4)
+     >                             - ablock(1,4)*bblock(4,4)
+     >                             - ablock(1,5)*bblock(5,4)
+         cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4)
+     >                             - ablock(2,2)*bblock(2,4)
+     >                             - ablock(2,3)*bblock(3,4)
+     >                             - ablock(2,4)*bblock(4,4)
+     >                             - ablock(2,5)*bblock(5,4)
+         cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4)
+     >                             - ablock(3,2)*bblock(2,4)
+     >                             - ablock(3,3)*bblock(3,4)
+     >                             - ablock(3,4)*bblock(4,4)
+     >                             - ablock(3,5)*bblock(5,4)
+         cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4)
+     >                             - ablock(4,2)*bblock(2,4)
+     >                             - ablock(4,3)*bblock(3,4)
+     >                             - ablock(4,4)*bblock(4,4)
+     >                             - ablock(4,5)*bblock(5,4)
+         cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4)
+     >                             - ablock(5,2)*bblock(2,4)
+     >                             - ablock(5,3)*bblock(3,4)
+     >                             - ablock(5,4)*bblock(4,4)
+     >                             - ablock(5,5)*bblock(5,4)
+         cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5)
+     >                             - ablock(1,2)*bblock(2,5)
+     >                             - ablock(1,3)*bblock(3,5)
+     >                             - ablock(1,4)*bblock(4,5)
+     >                             - ablock(1,5)*bblock(5,5)
+         cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5)
+     >                             - ablock(2,2)*bblock(2,5)
+     >                             - ablock(2,3)*bblock(3,5)
+     >                             - ablock(2,4)*bblock(4,5)
+     >                             - ablock(2,5)*bblock(5,5)
+         cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5)
+     >                             - ablock(3,2)*bblock(2,5)
+     >                             - ablock(3,3)*bblock(3,5)
+     >                             - ablock(3,4)*bblock(4,5)
+     >                             - ablock(3,5)*bblock(5,5)
+         cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5)
+     >                             - ablock(4,2)*bblock(2,5)
+     >                             - ablock(4,3)*bblock(3,5)
+     >                             - ablock(4,4)*bblock(4,5)
+     >                             - ablock(4,5)*bblock(5,5)
+         cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5)
+     >                             - ablock(5,2)*bblock(2,5)
+     >                             - ablock(5,3)*bblock(3,5)
+     >                             - ablock(5,4)*bblock(4,5)
+     >                             - ablock(5,5)*bblock(5,5)
+
+              
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision c(5,5), r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      c(1,1) = c(1,1)*pivot
+      c(1,2) = c(1,2)*pivot
+      c(1,3) = c(1,3)*pivot
+      c(1,4) = c(1,4)*pivot
+      c(1,5) = c(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      c(2,1) = c(2,1) - coeff*c(1,1)
+      c(2,2) = c(2,2) - coeff*c(1,2)
+      c(2,3) = c(2,3) - coeff*c(1,3)
+      c(2,4) = c(2,4) - coeff*c(1,4)
+      c(2,5) = c(2,5) - coeff*c(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      c(3,1) = c(3,1) - coeff*c(1,1)
+      c(3,2) = c(3,2) - coeff*c(1,2)
+      c(3,3) = c(3,3) - coeff*c(1,3)
+      c(3,4) = c(3,4) - coeff*c(1,4)
+      c(3,5) = c(3,5) - coeff*c(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      c(4,1) = c(4,1) - coeff*c(1,1)
+      c(4,2) = c(4,2) - coeff*c(1,2)
+      c(4,3) = c(4,3) - coeff*c(1,3)
+      c(4,4) = c(4,4) - coeff*c(1,4)
+      c(4,5) = c(4,5) - coeff*c(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      c(5,1) = c(5,1) - coeff*c(1,1)
+      c(5,2) = c(5,2) - coeff*c(1,2)
+      c(5,3) = c(5,3) - coeff*c(1,3)
+      c(5,4) = c(5,4) - coeff*c(1,4)
+      c(5,5) = c(5,5) - coeff*c(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      c(2,1) = c(2,1)*pivot
+      c(2,2) = c(2,2)*pivot
+      c(2,3) = c(2,3)*pivot
+      c(2,4) = c(2,4)*pivot
+      c(2,5) = c(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      c(1,1) = c(1,1) - coeff*c(2,1)
+      c(1,2) = c(1,2) - coeff*c(2,2)
+      c(1,3) = c(1,3) - coeff*c(2,3)
+      c(1,4) = c(1,4) - coeff*c(2,4)
+      c(1,5) = c(1,5) - coeff*c(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      c(3,1) = c(3,1) - coeff*c(2,1)
+      c(3,2) = c(3,2) - coeff*c(2,2)
+      c(3,3) = c(3,3) - coeff*c(2,3)
+      c(3,4) = c(3,4) - coeff*c(2,4)
+      c(3,5) = c(3,5) - coeff*c(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      c(4,1) = c(4,1) - coeff*c(2,1)
+      c(4,2) = c(4,2) - coeff*c(2,2)
+      c(4,3) = c(4,3) - coeff*c(2,3)
+      c(4,4) = c(4,4) - coeff*c(2,4)
+      c(4,5) = c(4,5) - coeff*c(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      c(5,1) = c(5,1) - coeff*c(2,1)
+      c(5,2) = c(5,2) - coeff*c(2,2)
+      c(5,3) = c(5,3) - coeff*c(2,3)
+      c(5,4) = c(5,4) - coeff*c(2,4)
+      c(5,5) = c(5,5) - coeff*c(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      c(3,1) = c(3,1)*pivot
+      c(3,2) = c(3,2)*pivot
+      c(3,3) = c(3,3)*pivot
+      c(3,4) = c(3,4)*pivot
+      c(3,5) = c(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      c(1,1) = c(1,1) - coeff*c(3,1)
+      c(1,2) = c(1,2) - coeff*c(3,2)
+      c(1,3) = c(1,3) - coeff*c(3,3)
+      c(1,4) = c(1,4) - coeff*c(3,4)
+      c(1,5) = c(1,5) - coeff*c(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      c(2,1) = c(2,1) - coeff*c(3,1)
+      c(2,2) = c(2,2) - coeff*c(3,2)
+      c(2,3) = c(2,3) - coeff*c(3,3)
+      c(2,4) = c(2,4) - coeff*c(3,4)
+      c(2,5) = c(2,5) - coeff*c(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      c(4,1) = c(4,1) - coeff*c(3,1)
+      c(4,2) = c(4,2) - coeff*c(3,2)
+      c(4,3) = c(4,3) - coeff*c(3,3)
+      c(4,4) = c(4,4) - coeff*c(3,4)
+      c(4,5) = c(4,5) - coeff*c(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      c(5,1) = c(5,1) - coeff*c(3,1)
+      c(5,2) = c(5,2) - coeff*c(3,2)
+      c(5,3) = c(5,3) - coeff*c(3,3)
+      c(5,4) = c(5,4) - coeff*c(3,4)
+      c(5,5) = c(5,5) - coeff*c(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      c(4,1) = c(4,1)*pivot
+      c(4,2) = c(4,2)*pivot
+      c(4,3) = c(4,3)*pivot
+      c(4,4) = c(4,4)*pivot
+      c(4,5) = c(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      c(1,1) = c(1,1) - coeff*c(4,1)
+      c(1,2) = c(1,2) - coeff*c(4,2)
+      c(1,3) = c(1,3) - coeff*c(4,3)
+      c(1,4) = c(1,4) - coeff*c(4,4)
+      c(1,5) = c(1,5) - coeff*c(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      c(2,1) = c(2,1) - coeff*c(4,1)
+      c(2,2) = c(2,2) - coeff*c(4,2)
+      c(2,3) = c(2,3) - coeff*c(4,3)
+      c(2,4) = c(2,4) - coeff*c(4,4)
+      c(2,5) = c(2,5) - coeff*c(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      c(3,1) = c(3,1) - coeff*c(4,1)
+      c(3,2) = c(3,2) - coeff*c(4,2)
+      c(3,3) = c(3,3) - coeff*c(4,3)
+      c(3,4) = c(3,4) - coeff*c(4,4)
+      c(3,5) = c(3,5) - coeff*c(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      c(5,1) = c(5,1) - coeff*c(4,1)
+      c(5,2) = c(5,2) - coeff*c(4,2)
+      c(5,3) = c(5,3) - coeff*c(4,3)
+      c(5,4) = c(5,4) - coeff*c(4,4)
+      c(5,5) = c(5,5) - coeff*c(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      c(5,1) = c(5,1)*pivot
+      c(5,2) = c(5,2)*pivot
+      c(5,3) = c(5,3)*pivot
+      c(5,4) = c(5,4)*pivot
+      c(5,5) = c(5,5)*pivot
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      c(1,1) = c(1,1) - coeff*c(5,1)
+      c(1,2) = c(1,2) - coeff*c(5,2)
+      c(1,3) = c(1,3) - coeff*c(5,3)
+      c(1,4) = c(1,4) - coeff*c(5,4)
+      c(1,5) = c(1,5) - coeff*c(5,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      c(2,1) = c(2,1) - coeff*c(5,1)
+      c(2,2) = c(2,2) - coeff*c(5,2)
+      c(2,3) = c(2,3) - coeff*c(5,3)
+      c(2,4) = c(2,4) - coeff*c(5,4)
+      c(2,5) = c(2,5) - coeff*c(5,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      c(3,1) = c(3,1) - coeff*c(5,1)
+      c(3,2) = c(3,2) - coeff*c(5,2)
+      c(3,3) = c(3,3) - coeff*c(5,3)
+      c(3,4) = c(3,4) - coeff*c(5,4)
+      c(3,5) = c(3,5) - coeff*c(5,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      c(4,1) = c(4,1) - coeff*c(5,1)
+      c(4,2) = c(4,2) - coeff*c(5,2)
+      c(4,3) = c(4,3) - coeff*c(5,3)
+      c(4,4) = c(4,4) - coeff*c(5,4)
+      c(4,5) = c(4,5) - coeff*c(5,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/verify.f
new file mode 100644
index 0000000..52551bf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/verify.f
@@ -0,0 +1,358 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 60    ))  then
+
+           class = 'S'
+           dtref = 1.0d-2
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.7034283709541311d-01
+         xcrref(2) = 1.2975252070034097d-02
+         xcrref(3) = 3.2527926989486055d-02
+         xcrref(4) = 2.6436421275166801d-02
+         xcrref(5) = 1.9211784131744430d-01
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 4.9976913345811579d-04
+         xceref(2) = 4.5195666782961927d-05
+         xceref(3) = 7.3973765172921357d-05
+         xceref(4) = 7.3821238632439731d-05
+         xceref(5) = 8.9269630987491446d-04
+
+c---------------------------------------------------------------------
+c    reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 24) .and. 
+     >           (grid_points(2) .eq. 24) .and.
+     >           (grid_points(3) .eq. 24) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'W'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1125590409344d+03
+           xcrref(2) = 0.1180007595731d+02
+           xcrref(3) = 0.2710329767846d+02
+           xcrref(4) = 0.2469174937669d+02
+           xcrref(5) = 0.2638427874317d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.4419655736008d+01
+           xceref(2) = 0.4638531260002d+00
+           xceref(3) = 0.1011551749967d+01
+           xceref(4) = 0.9235878729944d+00
+           xceref(5) = 0.1018045837718d+02
+
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'A'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.0806346714637264d+02
+         xcrref(2) = 1.1319730901220813d+01
+         xcrref(3) = 2.5974354511582465d+01
+         xcrref(4) = 2.3665622544678910d+01
+         xcrref(5) = 2.5278963211748344d+02
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 4.2348416040525025d+00
+         xceref(2) = 4.4390282496995698d-01
+         xceref(3) = 9.6692480136345650d-01
+         xceref(4) = 8.8302063039765474d-01
+         xceref(5) = 9.7379901770829278d+00
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 200 time steps,
+c    with DT = 3.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'B'
+           dtref = 3.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.4233597229287254d+03
+         xcrref(2) = 9.9330522590150238d+01
+         xcrref(3) = 3.5646025644535285d+02
+         xcrref(4) = 3.2485447959084092d+02
+         xcrref(5) = 3.2707541254659363d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 5.2969847140936856d+01
+         xceref(2) = 4.4632896115670668d+00
+         xceref(3) = 1.3122573342210174d+01
+         xceref(4) = 1.2006925323559144d+01
+         xceref(5) = 1.2459576151035986d+02
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 200 time steps,
+c    with DT = 1.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'C'
+           dtref = 1.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.62398116551764615d+04
+         xcrref(2) = 0.50793239190423964d+03
+         xcrref(3) = 0.15423530093013596d+04
+         xcrref(4) = 0.13302387929291190d+04
+         xcrref(5) = 0.11604087428436455d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 0.16462008369091265d+03
+         xceref(2) = 0.11497107903824313d+02
+         xceref(3) = 0.41207446207461508d+02
+         xceref(4) = 0.37087651059694167d+02
+         xceref(5) = 0.36211053051841265d+03
+
+c---------------------------------------------------------------------
+c    reference data for 408x408x408 grids after 250 time steps,
+c    with DT = 0.2d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'D'
+           dtref = 0.2d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2533188551738d+05
+         xcrref(2) = 0.2346393716980d+04
+         xcrref(3) = 0.6294554366904d+04
+         xcrref(4) = 0.5352565376030d+04
+         xcrref(5) = 0.3905864038618d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         xceref(1) = 0.3100009377557d+03
+         xceref(2) = 0.2424086324913d+02
+         xceref(3) = 0.7782212022645d+02
+         xceref(4) = 0.6835623860116d+02
+         xceref(5) = 0.6065737200368d+03
+
+
+c---------------------------------------------------------------------
+c    reference data for 1020x1020x1020 grids after 250 time steps,
+c    with DT = 0.4d-05
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'E'
+           dtref = 0.4d-5
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.9795372484517d+05
+         xcrref(2) = 0.9739814511521d+04
+         xcrref(3) = 0.2467606342965d+05
+         xcrref(4) = 0.2092419572860d+05
+         xcrref(5) = 0.1392138856939d+06
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         xceref(1) = 0.4327562208414d+03
+         xceref(2) = 0.3699051964887d+02
+         xceref(3) = 0.1089845040954d+03
+         xceref(4) = 0.9462517622043d+02
+         xceref(5) = 0.7765512765309d+03
+
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs.h
new file mode 100644
index 0000000..dc80893
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs.h
@@ -0,0 +1,14 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  work_lhs.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+      double precision fjac(5, 5,    0:problem_size),
+     >                 njac(5, 5,    0:problem_size),
+     >                 lhs (5, 5, 3, 0:problem_size),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhs, tmp1, tmp2, tmp3
+!$omp threadprivate (/work_lhs/)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs_vec.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs_vec.h
new file mode 100644
index 0000000..6aedbb8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/work_lhs_vec.h
@@ -0,0 +1,14 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  work_lhs_vec.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+      double precision fjac(5, 5,    0:problem_size, 0:problem_size),
+     >                 njac(5, 5,    0:problem_size, 0:problem_size),
+     >                 lhs (5, 5, 3, 0:problem_size, 0:problem_size),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhs, tmp1, tmp2, tmp3
+!$omp threadprivate (/work_lhs/)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f
new file mode 100644
index 0000000..da65271
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f
@@ -0,0 +1,401 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i,j,k,m,n,isize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(isize)
+!$omp& private(i,j,k,m,n)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 0, isize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i) = 0.0d+00
+               fjac(1,2,i) = 1.0d+00
+               fjac(1,3,i) = 0.0d+00
+               fjac(1,4,i) = 0.0d+00
+               fjac(1,5,i) = 0.0d+00
+
+               fjac(2,1,i) = -(u(2,i,j,k) * tmp2 * 
+     >              u(2,i,j,k))
+     >              + c2 * qs(i,j,k)
+               fjac(2,2,i) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(2,3,i) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(2,4,i) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(2,5,i) = c2
+
+               fjac(3,1,i) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(3,2,i) = u(3,i,j,k) * tmp1
+               fjac(3,3,i) = u(2,i,j,k) * tmp1
+               fjac(3,4,i) = 0.0d+00
+               fjac(3,5,i) = 0.0d+00
+
+               fjac(4,1,i) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(4,2,i) = u(4,i,j,k) * tmp1
+               fjac(4,3,i) = 0.0d+00
+               fjac(4,4,i) = u(2,i,j,k) * tmp1
+               fjac(4,5,i) = 0.0d+00
+
+               fjac(5,1,i) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * ( u(2,i,j,k) * tmp2 )
+               fjac(5,2,i) = c1 *  u(5,i,j,k) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2
+     >              + qs(i,j,k) )
+               fjac(5,3,i) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(1,1,i) = 0.0d+00
+               njac(1,2,i) = 0.0d+00
+               njac(1,3,i) = 0.0d+00
+               njac(1,4,i) = 0.0d+00
+               njac(1,5,i) = 0.0d+00
+
+               njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i) =   con43 * c3c4 * tmp1
+               njac(2,3,i) =   0.0d+00
+               njac(2,4,i) =   0.0d+00
+               njac(2,5,i) =   0.0d+00
+
+               njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i) =   0.0d+00
+               njac(3,3,i) =   c3c4 * tmp1
+               njac(3,4,i) =   0.0d+00
+               njac(3,5,i) =   0.0d+00
+
+               njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i) =   0.0d+00 
+               njac(4,3,i) =   0.0d+00
+               njac(4,4,i) =   c3c4 * tmp1
+               njac(4,5,i) =   0.0d+00
+
+               njac(5,1,i) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i) = ( c1345 ) * tmp1
+
+            enddo
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, isize)
+            do i = 1, isize-1
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhs(1,1,aa,i) = - tmp2 * fjac(1,1,i-1)
+     >              - tmp1 * njac(1,1,i-1)
+     >              - tmp1 * dx1 
+               lhs(1,2,aa,i) = - tmp2 * fjac(1,2,i-1)
+     >              - tmp1 * njac(1,2,i-1)
+               lhs(1,3,aa,i) = - tmp2 * fjac(1,3,i-1)
+     >              - tmp1 * njac(1,3,i-1)
+               lhs(1,4,aa,i) = - tmp2 * fjac(1,4,i-1)
+     >              - tmp1 * njac(1,4,i-1)
+               lhs(1,5,aa,i) = - tmp2 * fjac(1,5,i-1)
+     >              - tmp1 * njac(1,5,i-1)
+
+               lhs(2,1,aa,i) = - tmp2 * fjac(2,1,i-1)
+     >              - tmp1 * njac(2,1,i-1)
+               lhs(2,2,aa,i) = - tmp2 * fjac(2,2,i-1)
+     >              - tmp1 * njac(2,2,i-1)
+     >              - tmp1 * dx2
+               lhs(2,3,aa,i) = - tmp2 * fjac(2,3,i-1)
+     >              - tmp1 * njac(2,3,i-1)
+               lhs(2,4,aa,i) = - tmp2 * fjac(2,4,i-1)
+     >              - tmp1 * njac(2,4,i-1)
+               lhs(2,5,aa,i) = - tmp2 * fjac(2,5,i-1)
+     >              - tmp1 * njac(2,5,i-1)
+
+               lhs(3,1,aa,i) = - tmp2 * fjac(3,1,i-1)
+     >              - tmp1 * njac(3,1,i-1)
+               lhs(3,2,aa,i) = - tmp2 * fjac(3,2,i-1)
+     >              - tmp1 * njac(3,2,i-1)
+               lhs(3,3,aa,i) = - tmp2 * fjac(3,3,i-1)
+     >              - tmp1 * njac(3,3,i-1)
+     >              - tmp1 * dx3 
+               lhs(3,4,aa,i) = - tmp2 * fjac(3,4,i-1)
+     >              - tmp1 * njac(3,4,i-1)
+               lhs(3,5,aa,i) = - tmp2 * fjac(3,5,i-1)
+     >              - tmp1 * njac(3,5,i-1)
+
+               lhs(4,1,aa,i) = - tmp2 * fjac(4,1,i-1)
+     >              - tmp1 * njac(4,1,i-1)
+               lhs(4,2,aa,i) = - tmp2 * fjac(4,2,i-1)
+     >              - tmp1 * njac(4,2,i-1)
+               lhs(4,3,aa,i) = - tmp2 * fjac(4,3,i-1)
+     >              - tmp1 * njac(4,3,i-1)
+               lhs(4,4,aa,i) = - tmp2 * fjac(4,4,i-1)
+     >              - tmp1 * njac(4,4,i-1)
+     >              - tmp1 * dx4
+               lhs(4,5,aa,i) = - tmp2 * fjac(4,5,i-1)
+     >              - tmp1 * njac(4,5,i-1)
+
+               lhs(5,1,aa,i) = - tmp2 * fjac(5,1,i-1)
+     >              - tmp1 * njac(5,1,i-1)
+               lhs(5,2,aa,i) = - tmp2 * fjac(5,2,i-1)
+     >              - tmp1 * njac(5,2,i-1)
+               lhs(5,3,aa,i) = - tmp2 * fjac(5,3,i-1)
+     >              - tmp1 * njac(5,3,i-1)
+               lhs(5,4,aa,i) = - tmp2 * fjac(5,4,i-1)
+     >              - tmp1 * njac(5,4,i-1)
+               lhs(5,5,aa,i) = - tmp2 * fjac(5,5,i-1)
+     >              - tmp1 * njac(5,5,i-1)
+     >              - tmp1 * dx5
+
+               lhs(1,1,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhs(1,2,bb,i) = tmp1 * 2.0d+00 * njac(1,2,i)
+               lhs(1,3,bb,i) = tmp1 * 2.0d+00 * njac(1,3,i)
+               lhs(1,4,bb,i) = tmp1 * 2.0d+00 * njac(1,4,i)
+               lhs(1,5,bb,i) = tmp1 * 2.0d+00 * njac(1,5,i)
+
+               lhs(2,1,bb,i) = tmp1 * 2.0d+00 * njac(2,1,i)
+               lhs(2,2,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhs(2,3,bb,i) = tmp1 * 2.0d+00 * njac(2,3,i)
+               lhs(2,4,bb,i) = tmp1 * 2.0d+00 * njac(2,4,i)
+               lhs(2,5,bb,i) = tmp1 * 2.0d+00 * njac(2,5,i)
+
+               lhs(3,1,bb,i) = tmp1 * 2.0d+00 * njac(3,1,i)
+               lhs(3,2,bb,i) = tmp1 * 2.0d+00 * njac(3,2,i)
+               lhs(3,3,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhs(3,4,bb,i) = tmp1 * 2.0d+00 * njac(3,4,i)
+               lhs(3,5,bb,i) = tmp1 * 2.0d+00 * njac(3,5,i)
+
+               lhs(4,1,bb,i) = tmp1 * 2.0d+00 * njac(4,1,i)
+               lhs(4,2,bb,i) = tmp1 * 2.0d+00 * njac(4,2,i)
+               lhs(4,3,bb,i) = tmp1 * 2.0d+00 * njac(4,3,i)
+               lhs(4,4,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhs(4,5,bb,i) = tmp1 * 2.0d+00 * njac(4,5,i)
+
+               lhs(5,1,bb,i) = tmp1 * 2.0d+00 * njac(5,1,i)
+               lhs(5,2,bb,i) = tmp1 * 2.0d+00 * njac(5,2,i)
+               lhs(5,3,bb,i) = tmp1 * 2.0d+00 * njac(5,3,i)
+               lhs(5,4,bb,i) = tmp1 * 2.0d+00 * njac(5,4,i)
+               lhs(5,5,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhs(1,1,cc,i) =  tmp2 * fjac(1,1,i+1)
+     >              - tmp1 * njac(1,1,i+1)
+     >              - tmp1 * dx1
+               lhs(1,2,cc,i) =  tmp2 * fjac(1,2,i+1)
+     >              - tmp1 * njac(1,2,i+1)
+               lhs(1,3,cc,i) =  tmp2 * fjac(1,3,i+1)
+     >              - tmp1 * njac(1,3,i+1)
+               lhs(1,4,cc,i) =  tmp2 * fjac(1,4,i+1)
+     >              - tmp1 * njac(1,4,i+1)
+               lhs(1,5,cc,i) =  tmp2 * fjac(1,5,i+1)
+     >              - tmp1 * njac(1,5,i+1)
+
+               lhs(2,1,cc,i) =  tmp2 * fjac(2,1,i+1)
+     >              - tmp1 * njac(2,1,i+1)
+               lhs(2,2,cc,i) =  tmp2 * fjac(2,2,i+1)
+     >              - tmp1 * njac(2,2,i+1)
+     >              - tmp1 * dx2
+               lhs(2,3,cc,i) =  tmp2 * fjac(2,3,i+1)
+     >              - tmp1 * njac(2,3,i+1)
+               lhs(2,4,cc,i) =  tmp2 * fjac(2,4,i+1)
+     >              - tmp1 * njac(2,4,i+1)
+               lhs(2,5,cc,i) =  tmp2 * fjac(2,5,i+1)
+     >              - tmp1 * njac(2,5,i+1)
+
+               lhs(3,1,cc,i) =  tmp2 * fjac(3,1,i+1)
+     >              - tmp1 * njac(3,1,i+1)
+               lhs(3,2,cc,i) =  tmp2 * fjac(3,2,i+1)
+     >              - tmp1 * njac(3,2,i+1)
+               lhs(3,3,cc,i) =  tmp2 * fjac(3,3,i+1)
+     >              - tmp1 * njac(3,3,i+1)
+     >              - tmp1 * dx3
+               lhs(3,4,cc,i) =  tmp2 * fjac(3,4,i+1)
+     >              - tmp1 * njac(3,4,i+1)
+               lhs(3,5,cc,i) =  tmp2 * fjac(3,5,i+1)
+     >              - tmp1 * njac(3,5,i+1)
+
+               lhs(4,1,cc,i) =  tmp2 * fjac(4,1,i+1)
+     >              - tmp1 * njac(4,1,i+1)
+               lhs(4,2,cc,i) =  tmp2 * fjac(4,2,i+1)
+     >              - tmp1 * njac(4,2,i+1)
+               lhs(4,3,cc,i) =  tmp2 * fjac(4,3,i+1)
+     >              - tmp1 * njac(4,3,i+1)
+               lhs(4,4,cc,i) =  tmp2 * fjac(4,4,i+1)
+     >              - tmp1 * njac(4,4,i+1)
+     >              - tmp1 * dx4
+               lhs(4,5,cc,i) =  tmp2 * fjac(4,5,i+1)
+     >              - tmp1 * njac(4,5,i+1)
+
+               lhs(5,1,cc,i) =  tmp2 * fjac(5,1,i+1)
+     >              - tmp1 * njac(5,1,i+1)
+               lhs(5,2,cc,i) =  tmp2 * fjac(5,2,i+1)
+     >              - tmp1 * njac(5,2,i+1)
+               lhs(5,3,cc,i) =  tmp2 * fjac(5,3,i+1)
+     >              - tmp1 * njac(5,3,i+1)
+               lhs(5,4,cc,i) =  tmp2 * fjac(5,4,i+1)
+     >              - tmp1 * njac(5,4,i+1)
+               lhs(5,5,cc,i) =  tmp2 * fjac(5,5,i+1)
+     >              - tmp1 * njac(5,5,i+1)
+     >              - tmp1 * dx5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(0,j,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,0,j,k) )
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do i=1,isize-1
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i),
+     >                         rhs(1,i-1,j,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i),
+     >                         lhs(1,1,cc,i-1),
+     >                         lhs(1,1,bb,i))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i),
+     >                        lhs(1,1,cc,i),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,isize),
+     >                         rhs(1,isize-1,j,k),rhs(1,isize,j,k))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,isize),
+     >                         lhs(1,1,cc,isize-1),
+     >                         lhs(1,1,bb,isize))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,isize),
+     >                       rhs(1,isize,j,k) )
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+            do i=isize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i)*rhs(n,i+1,j,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve_vec.f
new file mode 100644
index 0000000..c8d5792
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/x_solve_vec.f
@@ -0,0 +1,434 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i,j,k,m,n,isize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(isize)
+!$omp& private(i,j,k,m,n)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 0, isize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 1.0d+00
+               fjac(1,3,i,j) = 0.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = -(u(2,i,j,k) * tmp2 * 
+     >              u(2,i,j,k))
+     >              + c2 * qs(i,j,k)
+               fjac(2,2,i,j) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(2,3,i,j) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(2,4,i,j) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(2,5,i,j) = c2
+
+               fjac(3,1,i,j) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(3,2,i,j) = u(3,i,j,k) * tmp1
+               fjac(3,3,i,j) = u(2,i,j,k) * tmp1
+               fjac(3,4,i,j) = 0.0d+00
+               fjac(3,5,i,j) = 0.0d+00
+
+               fjac(4,1,i,j) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(4,2,i,j) = u(4,i,j,k) * tmp1
+               fjac(4,3,i,j) = 0.0d+00
+               fjac(4,4,i,j) = u(2,i,j,k) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * ( u(2,i,j,k) * tmp2 )
+               fjac(5,2,i,j) = c1 *  u(5,i,j,k) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2
+     >              + qs(i,j,k) )
+               fjac(5,3,i,j) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i,j) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,j) =   con43 * c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,j) =   0.0d+00 
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the left hand side for starters
+c     set diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,0,j) = 0.0d0
+                  lhs(m,n,bb,0,j) = 0.0d0
+                  lhs(m,n,cc,0,j) = 0.0d0
+                  lhs(m,n,aa,isize,j) = 0.0d0
+                  lhs(m,n,bb,isize,j) = 0.0d0
+                  lhs(m,n,cc,isize,j) = 0.0d0
+               end do
+               lhs(m,m,bb,0,j) = 1.0d0
+               lhs(m,m,bb,isize,j) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            do i = 1, isize-1
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhs(1,1,aa,i,j) = - tmp2 * fjac(1,1,i-1,j)
+     >              - tmp1 * njac(1,1,i-1,j)
+     >              - tmp1 * dx1 
+               lhs(1,2,aa,i,j) = - tmp2 * fjac(1,2,i-1,j)
+     >              - tmp1 * njac(1,2,i-1,j)
+               lhs(1,3,aa,i,j) = - tmp2 * fjac(1,3,i-1,j)
+     >              - tmp1 * njac(1,3,i-1,j)
+               lhs(1,4,aa,i,j) = - tmp2 * fjac(1,4,i-1,j)
+     >              - tmp1 * njac(1,4,i-1,j)
+               lhs(1,5,aa,i,j) = - tmp2 * fjac(1,5,i-1,j)
+     >              - tmp1 * njac(1,5,i-1,j)
+
+               lhs(2,1,aa,i,j) = - tmp2 * fjac(2,1,i-1,j)
+     >              - tmp1 * njac(2,1,i-1,j)
+               lhs(2,2,aa,i,j) = - tmp2 * fjac(2,2,i-1,j)
+     >              - tmp1 * njac(2,2,i-1,j)
+     >              - tmp1 * dx2
+               lhs(2,3,aa,i,j) = - tmp2 * fjac(2,3,i-1,j)
+     >              - tmp1 * njac(2,3,i-1,j)
+               lhs(2,4,aa,i,j) = - tmp2 * fjac(2,4,i-1,j)
+     >              - tmp1 * njac(2,4,i-1,j)
+               lhs(2,5,aa,i,j) = - tmp2 * fjac(2,5,i-1,j)
+     >              - tmp1 * njac(2,5,i-1,j)
+
+               lhs(3,1,aa,i,j) = - tmp2 * fjac(3,1,i-1,j)
+     >              - tmp1 * njac(3,1,i-1,j)
+               lhs(3,2,aa,i,j) = - tmp2 * fjac(3,2,i-1,j)
+     >              - tmp1 * njac(3,2,i-1,j)
+               lhs(3,3,aa,i,j) = - tmp2 * fjac(3,3,i-1,j)
+     >              - tmp1 * njac(3,3,i-1,j)
+     >              - tmp1 * dx3 
+               lhs(3,4,aa,i,j) = - tmp2 * fjac(3,4,i-1,j)
+     >              - tmp1 * njac(3,4,i-1,j)
+               lhs(3,5,aa,i,j) = - tmp2 * fjac(3,5,i-1,j)
+     >              - tmp1 * njac(3,5,i-1,j)
+
+               lhs(4,1,aa,i,j) = - tmp2 * fjac(4,1,i-1,j)
+     >              - tmp1 * njac(4,1,i-1,j)
+               lhs(4,2,aa,i,j) = - tmp2 * fjac(4,2,i-1,j)
+     >              - tmp1 * njac(4,2,i-1,j)
+               lhs(4,3,aa,i,j) = - tmp2 * fjac(4,3,i-1,j)
+     >              - tmp1 * njac(4,3,i-1,j)
+               lhs(4,4,aa,i,j) = - tmp2 * fjac(4,4,i-1,j)
+     >              - tmp1 * njac(4,4,i-1,j)
+     >              - tmp1 * dx4
+               lhs(4,5,aa,i,j) = - tmp2 * fjac(4,5,i-1,j)
+     >              - tmp1 * njac(4,5,i-1,j)
+
+               lhs(5,1,aa,i,j) = - tmp2 * fjac(5,1,i-1,j)
+     >              - tmp1 * njac(5,1,i-1,j)
+               lhs(5,2,aa,i,j) = - tmp2 * fjac(5,2,i-1,j)
+     >              - tmp1 * njac(5,2,i-1,j)
+               lhs(5,3,aa,i,j) = - tmp2 * fjac(5,3,i-1,j)
+     >              - tmp1 * njac(5,3,i-1,j)
+               lhs(5,4,aa,i,j) = - tmp2 * fjac(5,4,i-1,j)
+     >              - tmp1 * njac(5,4,i-1,j)
+               lhs(5,5,aa,i,j) = - tmp2 * fjac(5,5,i-1,j)
+     >              - tmp1 * njac(5,5,i-1,j)
+     >              - tmp1 * dx5
+
+               lhs(1,1,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhs(1,2,bb,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhs(1,3,bb,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhs(1,4,bb,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhs(1,5,bb,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhs(2,1,bb,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhs(2,2,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhs(2,3,bb,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhs(2,4,bb,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhs(2,5,bb,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhs(3,1,bb,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhs(3,2,bb,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhs(3,3,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhs(3,4,bb,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhs(3,5,bb,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhs(4,1,bb,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhs(4,2,bb,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhs(4,3,bb,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhs(4,4,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhs(4,5,bb,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhs(5,1,bb,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhs(5,2,bb,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhs(5,3,bb,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhs(5,4,bb,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhs(5,5,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhs(1,1,cc,i,j) =  tmp2 * fjac(1,1,i+1,j)
+     >              - tmp1 * njac(1,1,i+1,j)
+     >              - tmp1 * dx1
+               lhs(1,2,cc,i,j) =  tmp2 * fjac(1,2,i+1,j)
+     >              - tmp1 * njac(1,2,i+1,j)
+               lhs(1,3,cc,i,j) =  tmp2 * fjac(1,3,i+1,j)
+     >              - tmp1 * njac(1,3,i+1,j)
+               lhs(1,4,cc,i,j) =  tmp2 * fjac(1,4,i+1,j)
+     >              - tmp1 * njac(1,4,i+1,j)
+               lhs(1,5,cc,i,j) =  tmp2 * fjac(1,5,i+1,j)
+     >              - tmp1 * njac(1,5,i+1,j)
+
+               lhs(2,1,cc,i,j) =  tmp2 * fjac(2,1,i+1,j)
+     >              - tmp1 * njac(2,1,i+1,j)
+               lhs(2,2,cc,i,j) =  tmp2 * fjac(2,2,i+1,j)
+     >              - tmp1 * njac(2,2,i+1,j)
+     >              - tmp1 * dx2
+               lhs(2,3,cc,i,j) =  tmp2 * fjac(2,3,i+1,j)
+     >              - tmp1 * njac(2,3,i+1,j)
+               lhs(2,4,cc,i,j) =  tmp2 * fjac(2,4,i+1,j)
+     >              - tmp1 * njac(2,4,i+1,j)
+               lhs(2,5,cc,i,j) =  tmp2 * fjac(2,5,i+1,j)
+     >              - tmp1 * njac(2,5,i+1,j)
+
+               lhs(3,1,cc,i,j) =  tmp2 * fjac(3,1,i+1,j)
+     >              - tmp1 * njac(3,1,i+1,j)
+               lhs(3,2,cc,i,j) =  tmp2 * fjac(3,2,i+1,j)
+     >              - tmp1 * njac(3,2,i+1,j)
+               lhs(3,3,cc,i,j) =  tmp2 * fjac(3,3,i+1,j)
+     >              - tmp1 * njac(3,3,i+1,j)
+     >              - tmp1 * dx3
+               lhs(3,4,cc,i,j) =  tmp2 * fjac(3,4,i+1,j)
+     >              - tmp1 * njac(3,4,i+1,j)
+               lhs(3,5,cc,i,j) =  tmp2 * fjac(3,5,i+1,j)
+     >              - tmp1 * njac(3,5,i+1,j)
+
+               lhs(4,1,cc,i,j) =  tmp2 * fjac(4,1,i+1,j)
+     >              - tmp1 * njac(4,1,i+1,j)
+               lhs(4,2,cc,i,j) =  tmp2 * fjac(4,2,i+1,j)
+     >              - tmp1 * njac(4,2,i+1,j)
+               lhs(4,3,cc,i,j) =  tmp2 * fjac(4,3,i+1,j)
+     >              - tmp1 * njac(4,3,i+1,j)
+               lhs(4,4,cc,i,j) =  tmp2 * fjac(4,4,i+1,j)
+     >              - tmp1 * njac(4,4,i+1,j)
+     >              - tmp1 * dx4
+               lhs(4,5,cc,i,j) =  tmp2 * fjac(4,5,i+1,j)
+     >              - tmp1 * njac(4,5,i+1,j)
+
+               lhs(5,1,cc,i,j) =  tmp2 * fjac(5,1,i+1,j)
+     >              - tmp1 * njac(5,1,i+1,j)
+               lhs(5,2,cc,i,j) =  tmp2 * fjac(5,2,i+1,j)
+     >              - tmp1 * njac(5,2,i+1,j)
+               lhs(5,3,cc,i,j) =  tmp2 * fjac(5,3,i+1,j)
+     >              - tmp1 * njac(5,3,i+1,j)
+               lhs(5,4,cc,i,j) =  tmp2 * fjac(5,4,i+1,j)
+     >              - tmp1 * njac(5,4,i+1,j)
+               lhs(5,5,cc,i,j) =  tmp2 * fjac(5,5,i+1,j)
+     >              - tmp1 * njac(5,5,i+1,j)
+     >              - tmp1 * dx5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(0,j,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do j = 1, grid_points(2)-2
+            call binvcrhs( lhs(1,1,bb,0,j),
+     >                        lhs(1,1,cc,0,j),
+     >                        rhs(1,0,j,k) )
+         enddo
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+!dir$ ivdep
+!dir$ interchange(i,j)
+         do j = 1, grid_points(2)-2
+            do i=1,isize-1
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,j),
+     >                         rhs(1,i-1,j,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,j),
+     >                         lhs(1,1,cc,i-1,j),
+     >                         lhs(1,1,bb,i,j))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,j),
+     >                        lhs(1,1,cc,i,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do j = 1, grid_points(2)-2
+            call matvec_sub(lhs(1,1,aa,isize,j),
+     >                         rhs(1,isize-1,j,k),rhs(1,isize,j,k))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,isize,j),
+     >                         lhs(1,1,cc,isize-1,j),
+     >                         lhs(1,1,bb,isize,j))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,isize,j),
+     >                       rhs(1,isize,j,k) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+         do j = 1, grid_points(2)-2
+            do i=isize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,j)*rhs(n,i+1,j,k)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f
new file mode 100644
index 0000000..7f0f6fb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f
@@ -0,0 +1,401 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i, j, k, m, n, jsize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(jsize)
+!$omp& private(i,j,k,m,n)
+      do k = 1, grid_points(3)-2
+         do i = 1, grid_points(1)-2
+            do j = 0, jsize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,j) = 0.0d+00
+               fjac(1,2,j) = 0.0d+00
+               fjac(1,3,j) = 1.0d+00
+               fjac(1,4,j) = 0.0d+00
+               fjac(1,5,j) = 0.0d+00
+
+               fjac(2,1,j) = - ( u(2,i,j,k)*u(3,i,j,k) )
+     >              * tmp2
+               fjac(2,2,j) = u(3,i,j,k) * tmp1
+               fjac(2,3,j) = u(2,i,j,k) * tmp1
+               fjac(2,4,j) = 0.0d+00
+               fjac(2,5,j) = 0.0d+00
+
+               fjac(3,1,j) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)
+     >              + c2 * qs(i,j,k)
+               fjac(3,2,j) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(3,3,j) = ( 2.0d+00 - c2 )
+     >              *  u(3,i,j,k) * tmp1 
+               fjac(3,4,j) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(3,5,j) = c2
+
+               fjac(4,1,j) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(4,2,j) = 0.0d+00
+               fjac(4,3,j) = u(4,i,j,k) * tmp1
+               fjac(4,4,j) = u(3,i,j,k) * tmp1
+               fjac(4,5,j) = 0.0d+00
+
+               fjac(5,1,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * u(3,i,j,k) * tmp2
+               fjac(5,2,j) = - c2 * u(2,i,j,k)*u(3,i,j,k) 
+     >              * tmp2
+               fjac(5,3,j) = c1 * u(5,i,j,k) * tmp1 
+     >              - c2 
+     >              * ( qs(i,j,k)
+     >              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(5,4,j) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,5,j) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(1,1,j) = 0.0d+00
+               njac(1,2,j) = 0.0d+00
+               njac(1,3,j) = 0.0d+00
+               njac(1,4,j) = 0.0d+00
+               njac(1,5,j) = 0.0d+00
+
+               njac(2,1,j) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,j) =   c3c4 * tmp1
+               njac(2,3,j) =   0.0d+00
+               njac(2,4,j) =   0.0d+00
+               njac(2,5,j) =   0.0d+00
+
+               njac(3,1,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,j) =   0.0d+00
+               njac(3,3,j) =   con43 * c3c4 * tmp1
+               njac(3,4,j) =   0.0d+00
+               njac(3,5,j) =   0.0d+00
+
+               njac(4,1,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,j) =   0.0d+00
+               njac(4,3,j) =   0.0d+00
+               njac(4,4,j) =   c3c4 * tmp1
+               njac(4,5,j) =   0.0d+00
+
+               njac(5,1,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,j) = ( c1345 ) * tmp1
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, jsize)
+            do j = 1, jsize-1
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhs(1,1,aa,j) = - tmp2 * fjac(1,1,j-1)
+     >              - tmp1 * njac(1,1,j-1)
+     >              - tmp1 * dy1 
+               lhs(1,2,aa,j) = - tmp2 * fjac(1,2,j-1)
+     >              - tmp1 * njac(1,2,j-1)
+               lhs(1,3,aa,j) = - tmp2 * fjac(1,3,j-1)
+     >              - tmp1 * njac(1,3,j-1)
+               lhs(1,4,aa,j) = - tmp2 * fjac(1,4,j-1)
+     >              - tmp1 * njac(1,4,j-1)
+               lhs(1,5,aa,j) = - tmp2 * fjac(1,5,j-1)
+     >              - tmp1 * njac(1,5,j-1)
+
+               lhs(2,1,aa,j) = - tmp2 * fjac(2,1,j-1)
+     >              - tmp1 * njac(2,1,j-1)
+               lhs(2,2,aa,j) = - tmp2 * fjac(2,2,j-1)
+     >              - tmp1 * njac(2,2,j-1)
+     >              - tmp1 * dy2
+               lhs(2,3,aa,j) = - tmp2 * fjac(2,3,j-1)
+     >              - tmp1 * njac(2,3,j-1)
+               lhs(2,4,aa,j) = - tmp2 * fjac(2,4,j-1)
+     >              - tmp1 * njac(2,4,j-1)
+               lhs(2,5,aa,j) = - tmp2 * fjac(2,5,j-1)
+     >              - tmp1 * njac(2,5,j-1)
+
+               lhs(3,1,aa,j) = - tmp2 * fjac(3,1,j-1)
+     >              - tmp1 * njac(3,1,j-1)
+               lhs(3,2,aa,j) = - tmp2 * fjac(3,2,j-1)
+     >              - tmp1 * njac(3,2,j-1)
+               lhs(3,3,aa,j) = - tmp2 * fjac(3,3,j-1)
+     >              - tmp1 * njac(3,3,j-1)
+     >              - tmp1 * dy3 
+               lhs(3,4,aa,j) = - tmp2 * fjac(3,4,j-1)
+     >              - tmp1 * njac(3,4,j-1)
+               lhs(3,5,aa,j) = - tmp2 * fjac(3,5,j-1)
+     >              - tmp1 * njac(3,5,j-1)
+
+               lhs(4,1,aa,j) = - tmp2 * fjac(4,1,j-1)
+     >              - tmp1 * njac(4,1,j-1)
+               lhs(4,2,aa,j) = - tmp2 * fjac(4,2,j-1)
+     >              - tmp1 * njac(4,2,j-1)
+               lhs(4,3,aa,j) = - tmp2 * fjac(4,3,j-1)
+     >              - tmp1 * njac(4,3,j-1)
+               lhs(4,4,aa,j) = - tmp2 * fjac(4,4,j-1)
+     >              - tmp1 * njac(4,4,j-1)
+     >              - tmp1 * dy4
+               lhs(4,5,aa,j) = - tmp2 * fjac(4,5,j-1)
+     >              - tmp1 * njac(4,5,j-1)
+
+               lhs(5,1,aa,j) = - tmp2 * fjac(5,1,j-1)
+     >              - tmp1 * njac(5,1,j-1)
+               lhs(5,2,aa,j) = - tmp2 * fjac(5,2,j-1)
+     >              - tmp1 * njac(5,2,j-1)
+               lhs(5,3,aa,j) = - tmp2 * fjac(5,3,j-1)
+     >              - tmp1 * njac(5,3,j-1)
+               lhs(5,4,aa,j) = - tmp2 * fjac(5,4,j-1)
+     >              - tmp1 * njac(5,4,j-1)
+               lhs(5,5,aa,j) = - tmp2 * fjac(5,5,j-1)
+     >              - tmp1 * njac(5,5,j-1)
+     >              - tmp1 * dy5
+
+               lhs(1,1,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhs(1,2,bb,j) = tmp1 * 2.0d+00 * njac(1,2,j)
+               lhs(1,3,bb,j) = tmp1 * 2.0d+00 * njac(1,3,j)
+               lhs(1,4,bb,j) = tmp1 * 2.0d+00 * njac(1,4,j)
+               lhs(1,5,bb,j) = tmp1 * 2.0d+00 * njac(1,5,j)
+
+               lhs(2,1,bb,j) = tmp1 * 2.0d+00 * njac(2,1,j)
+               lhs(2,2,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhs(2,3,bb,j) = tmp1 * 2.0d+00 * njac(2,3,j)
+               lhs(2,4,bb,j) = tmp1 * 2.0d+00 * njac(2,4,j)
+               lhs(2,5,bb,j) = tmp1 * 2.0d+00 * njac(2,5,j)
+
+               lhs(3,1,bb,j) = tmp1 * 2.0d+00 * njac(3,1,j)
+               lhs(3,2,bb,j) = tmp1 * 2.0d+00 * njac(3,2,j)
+               lhs(3,3,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhs(3,4,bb,j) = tmp1 * 2.0d+00 * njac(3,4,j)
+               lhs(3,5,bb,j) = tmp1 * 2.0d+00 * njac(3,5,j)
+
+               lhs(4,1,bb,j) = tmp1 * 2.0d+00 * njac(4,1,j)
+               lhs(4,2,bb,j) = tmp1 * 2.0d+00 * njac(4,2,j)
+               lhs(4,3,bb,j) = tmp1 * 2.0d+00 * njac(4,3,j)
+               lhs(4,4,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhs(4,5,bb,j) = tmp1 * 2.0d+00 * njac(4,5,j)
+
+               lhs(5,1,bb,j) = tmp1 * 2.0d+00 * njac(5,1,j)
+               lhs(5,2,bb,j) = tmp1 * 2.0d+00 * njac(5,2,j)
+               lhs(5,3,bb,j) = tmp1 * 2.0d+00 * njac(5,3,j)
+               lhs(5,4,bb,j) = tmp1 * 2.0d+00 * njac(5,4,j)
+               lhs(5,5,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhs(1,1,cc,j) =  tmp2 * fjac(1,1,j+1)
+     >              - tmp1 * njac(1,1,j+1)
+     >              - tmp1 * dy1
+               lhs(1,2,cc,j) =  tmp2 * fjac(1,2,j+1)
+     >              - tmp1 * njac(1,2,j+1)
+               lhs(1,3,cc,j) =  tmp2 * fjac(1,3,j+1)
+     >              - tmp1 * njac(1,3,j+1)
+               lhs(1,4,cc,j) =  tmp2 * fjac(1,4,j+1)
+     >              - tmp1 * njac(1,4,j+1)
+               lhs(1,5,cc,j) =  tmp2 * fjac(1,5,j+1)
+     >              - tmp1 * njac(1,5,j+1)
+
+               lhs(2,1,cc,j) =  tmp2 * fjac(2,1,j+1)
+     >              - tmp1 * njac(2,1,j+1)
+               lhs(2,2,cc,j) =  tmp2 * fjac(2,2,j+1)
+     >              - tmp1 * njac(2,2,j+1)
+     >              - tmp1 * dy2
+               lhs(2,3,cc,j) =  tmp2 * fjac(2,3,j+1)
+     >              - tmp1 * njac(2,3,j+1)
+               lhs(2,4,cc,j) =  tmp2 * fjac(2,4,j+1)
+     >              - tmp1 * njac(2,4,j+1)
+               lhs(2,5,cc,j) =  tmp2 * fjac(2,5,j+1)
+     >              - tmp1 * njac(2,5,j+1)
+
+               lhs(3,1,cc,j) =  tmp2 * fjac(3,1,j+1)
+     >              - tmp1 * njac(3,1,j+1)
+               lhs(3,2,cc,j) =  tmp2 * fjac(3,2,j+1)
+     >              - tmp1 * njac(3,2,j+1)
+               lhs(3,3,cc,j) =  tmp2 * fjac(3,3,j+1)
+     >              - tmp1 * njac(3,3,j+1)
+     >              - tmp1 * dy3
+               lhs(3,4,cc,j) =  tmp2 * fjac(3,4,j+1)
+     >              - tmp1 * njac(3,4,j+1)
+               lhs(3,5,cc,j) =  tmp2 * fjac(3,5,j+1)
+     >              - tmp1 * njac(3,5,j+1)
+
+               lhs(4,1,cc,j) =  tmp2 * fjac(4,1,j+1)
+     >              - tmp1 * njac(4,1,j+1)
+               lhs(4,2,cc,j) =  tmp2 * fjac(4,2,j+1)
+     >              - tmp1 * njac(4,2,j+1)
+               lhs(4,3,cc,j) =  tmp2 * fjac(4,3,j+1)
+     >              - tmp1 * njac(4,3,j+1)
+               lhs(4,4,cc,j) =  tmp2 * fjac(4,4,j+1)
+     >              - tmp1 * njac(4,4,j+1)
+     >              - tmp1 * dy4
+               lhs(4,5,cc,j) =  tmp2 * fjac(4,5,j+1)
+     >              - tmp1 * njac(4,5,j+1)
+
+               lhs(5,1,cc,j) =  tmp2 * fjac(5,1,j+1)
+     >              - tmp1 * njac(5,1,j+1)
+               lhs(5,2,cc,j) =  tmp2 * fjac(5,2,j+1)
+     >              - tmp1 * njac(5,2,j+1)
+               lhs(5,3,cc,j) =  tmp2 * fjac(5,3,j+1)
+     >              - tmp1 * njac(5,3,j+1)
+               lhs(5,4,cc,j) =  tmp2 * fjac(5,4,j+1)
+     >              - tmp1 * njac(5,4,j+1)
+               lhs(5,5,cc,j) =  tmp2 * fjac(5,5,j+1)
+     >              - tmp1 * njac(5,5,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,0,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,i,0,k) )
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do j=1,jsize-1
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,j),
+     >                         rhs(1,i,j-1,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,j),
+     >                         lhs(1,1,cc,j-1),
+     >                         lhs(1,1,bb,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,j),
+     >                        lhs(1,1,cc,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,jsize),
+     >                         rhs(1,i,jsize-1,k),rhs(1,i,jsize,k))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,jsize),
+     >                         lhs(1,1,cc,jsize-1),
+     >                         lhs(1,1,bb,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,jsize),
+     >                       rhs(1,i,jsize,k) )
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+      
+            do j=jsize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,j)*rhs(n,i,j+1,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve_vec.f
new file mode 100644
index 0000000..93e565a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/y_solve_vec.f
@@ -0,0 +1,432 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i, j, k, m, n, jsize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors 
+c---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(jsize)
+!$omp& private(i,j,k,m,n)
+      do k = 1, grid_points(3)-2
+         do j = 0, jsize
+            do i = 1, grid_points(1)-2
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 0.0d+00
+               fjac(1,3,i,j) = 1.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = - ( u(2,i,j,k)*u(3,i,j,k) )
+     >              * tmp2
+               fjac(2,2,i,j) = u(3,i,j,k) * tmp1
+               fjac(2,3,i,j) = u(2,i,j,k) * tmp1
+               fjac(2,4,i,j) = 0.0d+00
+               fjac(2,5,i,j) = 0.0d+00
+
+               fjac(3,1,i,j) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)
+     >              + c2 * qs(i,j,k)
+               fjac(3,2,i,j) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(3,3,i,j) = ( 2.0d+00 - c2 )
+     >              *  u(3,i,j,k) * tmp1 
+               fjac(3,4,i,j) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(3,5,i,j) = c2
+
+               fjac(4,1,i,j) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(4,2,i,j) = 0.0d+00
+               fjac(4,3,i,j) = u(4,i,j,k) * tmp1
+               fjac(4,4,i,j) = u(3,i,j,k) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * u(3,i,j,k) * tmp2
+               fjac(5,2,i,j) = - c2 * u(2,i,j,k)*u(3,i,j,k) 
+     >              * tmp2
+               fjac(5,3,i,j) = c1 * u(5,i,j,k) * tmp1 
+     >              - c2 
+     >              * ( qs(i,j,k)
+     >              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(5,4,i,j) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,j) =   c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   con43 * c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,j) =   0.0d+00
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,i,0) = 0.0d0
+                  lhs(m,n,bb,i,0) = 0.0d0
+                  lhs(m,n,cc,i,0) = 0.0d0
+                  lhs(m,n,aa,i,jsize) = 0.0d0
+                  lhs(m,n,bb,i,jsize) = 0.0d0
+                  lhs(m,n,cc,i,jsize) = 0.0d0
+               end do
+               lhs(m,m,bb,i,0) = 1.0d0
+               lhs(m,m,bb,i,jsize) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+         do j = 1, jsize-1
+            do i = 1, grid_points(1)-2
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhs(1,1,aa,i,j) = - tmp2 * fjac(1,1,i,j-1)
+     >              - tmp1 * njac(1,1,i,j-1)
+     >              - tmp1 * dy1 
+               lhs(1,2,aa,i,j) = - tmp2 * fjac(1,2,i,j-1)
+     >              - tmp1 * njac(1,2,i,j-1)
+               lhs(1,3,aa,i,j) = - tmp2 * fjac(1,3,i,j-1)
+     >              - tmp1 * njac(1,3,i,j-1)
+               lhs(1,4,aa,i,j) = - tmp2 * fjac(1,4,i,j-1)
+     >              - tmp1 * njac(1,4,i,j-1)
+               lhs(1,5,aa,i,j) = - tmp2 * fjac(1,5,i,j-1)
+     >              - tmp1 * njac(1,5,i,j-1)
+
+               lhs(2,1,aa,i,j) = - tmp2 * fjac(2,1,i,j-1)
+     >              - tmp1 * njac(2,1,i,j-1)
+               lhs(2,2,aa,i,j) = - tmp2 * fjac(2,2,i,j-1)
+     >              - tmp1 * njac(2,2,i,j-1)
+     >              - tmp1 * dy2
+               lhs(2,3,aa,i,j) = - tmp2 * fjac(2,3,i,j-1)
+     >              - tmp1 * njac(2,3,i,j-1)
+               lhs(2,4,aa,i,j) = - tmp2 * fjac(2,4,i,j-1)
+     >              - tmp1 * njac(2,4,i,j-1)
+               lhs(2,5,aa,i,j) = - tmp2 * fjac(2,5,i,j-1)
+     >              - tmp1 * njac(2,5,i,j-1)
+
+               lhs(3,1,aa,i,j) = - tmp2 * fjac(3,1,i,j-1)
+     >              - tmp1 * njac(3,1,i,j-1)
+               lhs(3,2,aa,i,j) = - tmp2 * fjac(3,2,i,j-1)
+     >              - tmp1 * njac(3,2,i,j-1)
+               lhs(3,3,aa,i,j) = - tmp2 * fjac(3,3,i,j-1)
+     >              - tmp1 * njac(3,3,i,j-1)
+     >              - tmp1 * dy3 
+               lhs(3,4,aa,i,j) = - tmp2 * fjac(3,4,i,j-1)
+     >              - tmp1 * njac(3,4,i,j-1)
+               lhs(3,5,aa,i,j) = - tmp2 * fjac(3,5,i,j-1)
+     >              - tmp1 * njac(3,5,i,j-1)
+
+               lhs(4,1,aa,i,j) = - tmp2 * fjac(4,1,i,j-1)
+     >              - tmp1 * njac(4,1,i,j-1)
+               lhs(4,2,aa,i,j) = - tmp2 * fjac(4,2,i,j-1)
+     >              - tmp1 * njac(4,2,i,j-1)
+               lhs(4,3,aa,i,j) = - tmp2 * fjac(4,3,i,j-1)
+     >              - tmp1 * njac(4,3,i,j-1)
+               lhs(4,4,aa,i,j) = - tmp2 * fjac(4,4,i,j-1)
+     >              - tmp1 * njac(4,4,i,j-1)
+     >              - tmp1 * dy4
+               lhs(4,5,aa,i,j) = - tmp2 * fjac(4,5,i,j-1)
+     >              - tmp1 * njac(4,5,i,j-1)
+
+               lhs(5,1,aa,i,j) = - tmp2 * fjac(5,1,i,j-1)
+     >              - tmp1 * njac(5,1,i,j-1)
+               lhs(5,2,aa,i,j) = - tmp2 * fjac(5,2,i,j-1)
+     >              - tmp1 * njac(5,2,i,j-1)
+               lhs(5,3,aa,i,j) = - tmp2 * fjac(5,3,i,j-1)
+     >              - tmp1 * njac(5,3,i,j-1)
+               lhs(5,4,aa,i,j) = - tmp2 * fjac(5,4,i,j-1)
+     >              - tmp1 * njac(5,4,i,j-1)
+               lhs(5,5,aa,i,j) = - tmp2 * fjac(5,5,i,j-1)
+     >              - tmp1 * njac(5,5,i,j-1)
+     >              - tmp1 * dy5
+
+               lhs(1,1,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhs(1,2,bb,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhs(1,3,bb,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhs(1,4,bb,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhs(1,5,bb,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhs(2,1,bb,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhs(2,2,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhs(2,3,bb,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhs(2,4,bb,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhs(2,5,bb,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhs(3,1,bb,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhs(3,2,bb,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhs(3,3,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhs(3,4,bb,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhs(3,5,bb,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhs(4,1,bb,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhs(4,2,bb,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhs(4,3,bb,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhs(4,4,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhs(4,5,bb,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhs(5,1,bb,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhs(5,2,bb,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhs(5,3,bb,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhs(5,4,bb,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhs(5,5,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhs(1,1,cc,i,j) =  tmp2 * fjac(1,1,i,j+1)
+     >              - tmp1 * njac(1,1,i,j+1)
+     >              - tmp1 * dy1
+               lhs(1,2,cc,i,j) =  tmp2 * fjac(1,2,i,j+1)
+     >              - tmp1 * njac(1,2,i,j+1)
+               lhs(1,3,cc,i,j) =  tmp2 * fjac(1,3,i,j+1)
+     >              - tmp1 * njac(1,3,i,j+1)
+               lhs(1,4,cc,i,j) =  tmp2 * fjac(1,4,i,j+1)
+     >              - tmp1 * njac(1,4,i,j+1)
+               lhs(1,5,cc,i,j) =  tmp2 * fjac(1,5,i,j+1)
+     >              - tmp1 * njac(1,5,i,j+1)
+
+               lhs(2,1,cc,i,j) =  tmp2 * fjac(2,1,i,j+1)
+     >              - tmp1 * njac(2,1,i,j+1)
+               lhs(2,2,cc,i,j) =  tmp2 * fjac(2,2,i,j+1)
+     >              - tmp1 * njac(2,2,i,j+1)
+     >              - tmp1 * dy2
+               lhs(2,3,cc,i,j) =  tmp2 * fjac(2,3,i,j+1)
+     >              - tmp1 * njac(2,3,i,j+1)
+               lhs(2,4,cc,i,j) =  tmp2 * fjac(2,4,i,j+1)
+     >              - tmp1 * njac(2,4,i,j+1)
+               lhs(2,5,cc,i,j) =  tmp2 * fjac(2,5,i,j+1)
+     >              - tmp1 * njac(2,5,i,j+1)
+
+               lhs(3,1,cc,i,j) =  tmp2 * fjac(3,1,i,j+1)
+     >              - tmp1 * njac(3,1,i,j+1)
+               lhs(3,2,cc,i,j) =  tmp2 * fjac(3,2,i,j+1)
+     >              - tmp1 * njac(3,2,i,j+1)
+               lhs(3,3,cc,i,j) =  tmp2 * fjac(3,3,i,j+1)
+     >              - tmp1 * njac(3,3,i,j+1)
+     >              - tmp1 * dy3
+               lhs(3,4,cc,i,j) =  tmp2 * fjac(3,4,i,j+1)
+     >              - tmp1 * njac(3,4,i,j+1)
+               lhs(3,5,cc,i,j) =  tmp2 * fjac(3,5,i,j+1)
+     >              - tmp1 * njac(3,5,i,j+1)
+
+               lhs(4,1,cc,i,j) =  tmp2 * fjac(4,1,i,j+1)
+     >              - tmp1 * njac(4,1,i,j+1)
+               lhs(4,2,cc,i,j) =  tmp2 * fjac(4,2,i,j+1)
+     >              - tmp1 * njac(4,2,i,j+1)
+               lhs(4,3,cc,i,j) =  tmp2 * fjac(4,3,i,j+1)
+     >              - tmp1 * njac(4,3,i,j+1)
+               lhs(4,4,cc,i,j) =  tmp2 * fjac(4,4,i,j+1)
+     >              - tmp1 * njac(4,4,i,j+1)
+     >              - tmp1 * dy4
+               lhs(4,5,cc,i,j) =  tmp2 * fjac(4,5,i,j+1)
+     >              - tmp1 * njac(4,5,i,j+1)
+
+               lhs(5,1,cc,i,j) =  tmp2 * fjac(5,1,i,j+1)
+     >              - tmp1 * njac(5,1,i,j+1)
+               lhs(5,2,cc,i,j) =  tmp2 * fjac(5,2,i,j+1)
+     >              - tmp1 * njac(5,2,i,j+1)
+               lhs(5,3,cc,i,j) =  tmp2 * fjac(5,3,i,j+1)
+     >              - tmp1 * njac(5,3,i,j+1)
+               lhs(5,4,cc,i,j) =  tmp2 * fjac(5,4,i,j+1)
+     >              - tmp1 * njac(5,4,i,j+1)
+               lhs(5,5,cc,i,j) =  tmp2 * fjac(5,5,i,j+1)
+     >              - tmp1 * njac(5,5,i,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,0,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call binvcrhs( lhs(1,1,bb,i,0),
+     >                        lhs(1,1,cc,i,0),
+     >                        rhs(1,i,0,k) )
+         enddo
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do j=1,jsize-1
+!dir$ ivdep
+            do i = 1, grid_points(1)-2
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,j),
+     >                         rhs(1,i,j-1,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,j),
+     >                         lhs(1,1,cc,i,j-1),
+     >                         lhs(1,1,bb,i,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,j),
+     >                        lhs(1,1,cc,i,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call matvec_sub(lhs(1,1,aa,i,jsize),
+     >                         rhs(1,i,jsize-1,k),rhs(1,i,jsize,k))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,i,jsize),
+     >                         lhs(1,1,cc,i,jsize-1),
+     >                         lhs(1,1,bb,i,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,i,jsize),
+     >                       rhs(1,i,jsize,k) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+      
+         do j=jsize-1,0,-1
+            do i = 1, grid_points(1)-2
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,j)*rhs(n,i,j+1,k)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f
new file mode 100644
index 0000000..3c77f6a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f
@@ -0,0 +1,412 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i, j, k, m, n, ksize
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(ksize)
+!$omp& private(i,j,k,m,n)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do k = 0, ksize
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,k) = 0.0d+00
+               fjac(1,2,k) = 0.0d+00
+               fjac(1,3,k) = 0.0d+00
+               fjac(1,4,k) = 1.0d+00
+               fjac(1,5,k) = 0.0d+00
+
+               fjac(2,1,k) = - ( u(2,i,j,k)*u(4,i,j,k) ) 
+     >              * tmp2 
+               fjac(2,2,k) = u(4,i,j,k) * tmp1
+               fjac(2,3,k) = 0.0d+00
+               fjac(2,4,k) = u(2,i,j,k) * tmp1
+               fjac(2,5,k) = 0.0d+00
+
+               fjac(3,1,k) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(3,2,k) = 0.0d+00
+               fjac(3,3,k) = u(4,i,j,k) * tmp1
+               fjac(3,4,k) = u(3,i,j,k) * tmp1
+               fjac(3,5,k) = 0.0d+00
+
+               fjac(4,1,k) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 ) 
+     >              + c2 * qs(i,j,k)
+               fjac(4,2,k) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(4,3,k) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(4,4,k) = ( 2.0d+00 - c2 )
+     >              *  u(4,i,j,k) * tmp1 
+               fjac(4,5,k) = c2
+
+               fjac(5,1,k) = ( c2 * 2.0d0 * square(i,j,k) 
+     >              - c1 * u(5,i,j,k) )
+     >              * u(4,i,j,k) * tmp2
+               fjac(5,2,k) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(5,3,k) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,4,k) = c1 * ( u(5,i,j,k) * tmp1 )
+     >              - c2
+     >              * ( qs(i,j,k)
+     >              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(5,5,k) = c1 * u(4,i,j,k) * tmp1
+
+               njac(1,1,k) = 0.0d+00
+               njac(1,2,k) = 0.0d+00
+               njac(1,3,k) = 0.0d+00
+               njac(1,4,k) = 0.0d+00
+               njac(1,5,k) = 0.0d+00
+
+               njac(2,1,k) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,k) =   c3c4 * tmp1
+               njac(2,3,k) =   0.0d+00
+               njac(2,4,k) =   0.0d+00
+               njac(2,5,k) =   0.0d+00
+
+               njac(3,1,k) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,k) =   0.0d+00
+               njac(3,3,k) =   c3c4 * tmp1
+               njac(3,4,k) =   0.0d+00
+               njac(3,5,k) =   0.0d+00
+
+               njac(4,1,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,k) =   0.0d+00
+               njac(4,3,k) =   0.0d+00
+               njac(4,4,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,k) =   0.0d+00
+
+               njac(5,1,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,k) = ( c1345 )* tmp1
+
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, ksize)
+            do k = 1, ksize-1
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhs(1,1,aa,k) = - tmp2 * fjac(1,1,k-1)
+     >              - tmp1 * njac(1,1,k-1)
+     >              - tmp1 * dz1 
+               lhs(1,2,aa,k) = - tmp2 * fjac(1,2,k-1)
+     >              - tmp1 * njac(1,2,k-1)
+               lhs(1,3,aa,k) = - tmp2 * fjac(1,3,k-1)
+     >              - tmp1 * njac(1,3,k-1)
+               lhs(1,4,aa,k) = - tmp2 * fjac(1,4,k-1)
+     >              - tmp1 * njac(1,4,k-1)
+               lhs(1,5,aa,k) = - tmp2 * fjac(1,5,k-1)
+     >              - tmp1 * njac(1,5,k-1)
+
+               lhs(2,1,aa,k) = - tmp2 * fjac(2,1,k-1)
+     >              - tmp1 * njac(2,1,k-1)
+               lhs(2,2,aa,k) = - tmp2 * fjac(2,2,k-1)
+     >              - tmp1 * njac(2,2,k-1)
+     >              - tmp1 * dz2
+               lhs(2,3,aa,k) = - tmp2 * fjac(2,3,k-1)
+     >              - tmp1 * njac(2,3,k-1)
+               lhs(2,4,aa,k) = - tmp2 * fjac(2,4,k-1)
+     >              - tmp1 * njac(2,4,k-1)
+               lhs(2,5,aa,k) = - tmp2 * fjac(2,5,k-1)
+     >              - tmp1 * njac(2,5,k-1)
+
+               lhs(3,1,aa,k) = - tmp2 * fjac(3,1,k-1)
+     >              - tmp1 * njac(3,1,k-1)
+               lhs(3,2,aa,k) = - tmp2 * fjac(3,2,k-1)
+     >              - tmp1 * njac(3,2,k-1)
+               lhs(3,3,aa,k) = - tmp2 * fjac(3,3,k-1)
+     >              - tmp1 * njac(3,3,k-1)
+     >              - tmp1 * dz3 
+               lhs(3,4,aa,k) = - tmp2 * fjac(3,4,k-1)
+     >              - tmp1 * njac(3,4,k-1)
+               lhs(3,5,aa,k) = - tmp2 * fjac(3,5,k-1)
+     >              - tmp1 * njac(3,5,k-1)
+
+               lhs(4,1,aa,k) = - tmp2 * fjac(4,1,k-1)
+     >              - tmp1 * njac(4,1,k-1)
+               lhs(4,2,aa,k) = - tmp2 * fjac(4,2,k-1)
+     >              - tmp1 * njac(4,2,k-1)
+               lhs(4,3,aa,k) = - tmp2 * fjac(4,3,k-1)
+     >              - tmp1 * njac(4,3,k-1)
+               lhs(4,4,aa,k) = - tmp2 * fjac(4,4,k-1)
+     >              - tmp1 * njac(4,4,k-1)
+     >              - tmp1 * dz4
+               lhs(4,5,aa,k) = - tmp2 * fjac(4,5,k-1)
+     >              - tmp1 * njac(4,5,k-1)
+
+               lhs(5,1,aa,k) = - tmp2 * fjac(5,1,k-1)
+     >              - tmp1 * njac(5,1,k-1)
+               lhs(5,2,aa,k) = - tmp2 * fjac(5,2,k-1)
+     >              - tmp1 * njac(5,2,k-1)
+               lhs(5,3,aa,k) = - tmp2 * fjac(5,3,k-1)
+     >              - tmp1 * njac(5,3,k-1)
+               lhs(5,4,aa,k) = - tmp2 * fjac(5,4,k-1)
+     >              - tmp1 * njac(5,4,k-1)
+               lhs(5,5,aa,k) = - tmp2 * fjac(5,5,k-1)
+     >              - tmp1 * njac(5,5,k-1)
+     >              - tmp1 * dz5
+
+               lhs(1,1,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhs(1,2,bb,k) = tmp1 * 2.0d+00 * njac(1,2,k)
+               lhs(1,3,bb,k) = tmp1 * 2.0d+00 * njac(1,3,k)
+               lhs(1,4,bb,k) = tmp1 * 2.0d+00 * njac(1,4,k)
+               lhs(1,5,bb,k) = tmp1 * 2.0d+00 * njac(1,5,k)
+
+               lhs(2,1,bb,k) = tmp1 * 2.0d+00 * njac(2,1,k)
+               lhs(2,2,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhs(2,3,bb,k) = tmp1 * 2.0d+00 * njac(2,3,k)
+               lhs(2,4,bb,k) = tmp1 * 2.0d+00 * njac(2,4,k)
+               lhs(2,5,bb,k) = tmp1 * 2.0d+00 * njac(2,5,k)
+
+               lhs(3,1,bb,k) = tmp1 * 2.0d+00 * njac(3,1,k)
+               lhs(3,2,bb,k) = tmp1 * 2.0d+00 * njac(3,2,k)
+               lhs(3,3,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhs(3,4,bb,k) = tmp1 * 2.0d+00 * njac(3,4,k)
+               lhs(3,5,bb,k) = tmp1 * 2.0d+00 * njac(3,5,k)
+
+               lhs(4,1,bb,k) = tmp1 * 2.0d+00 * njac(4,1,k)
+               lhs(4,2,bb,k) = tmp1 * 2.0d+00 * njac(4,2,k)
+               lhs(4,3,bb,k) = tmp1 * 2.0d+00 * njac(4,3,k)
+               lhs(4,4,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhs(4,5,bb,k) = tmp1 * 2.0d+00 * njac(4,5,k)
+
+               lhs(5,1,bb,k) = tmp1 * 2.0d+00 * njac(5,1,k)
+               lhs(5,2,bb,k) = tmp1 * 2.0d+00 * njac(5,2,k)
+               lhs(5,3,bb,k) = tmp1 * 2.0d+00 * njac(5,3,k)
+               lhs(5,4,bb,k) = tmp1 * 2.0d+00 * njac(5,4,k)
+               lhs(5,5,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhs(1,1,cc,k) =  tmp2 * fjac(1,1,k+1)
+     >              - tmp1 * njac(1,1,k+1)
+     >              - tmp1 * dz1
+               lhs(1,2,cc,k) =  tmp2 * fjac(1,2,k+1)
+     >              - tmp1 * njac(1,2,k+1)
+               lhs(1,3,cc,k) =  tmp2 * fjac(1,3,k+1)
+     >              - tmp1 * njac(1,3,k+1)
+               lhs(1,4,cc,k) =  tmp2 * fjac(1,4,k+1)
+     >              - tmp1 * njac(1,4,k+1)
+               lhs(1,5,cc,k) =  tmp2 * fjac(1,5,k+1)
+     >              - tmp1 * njac(1,5,k+1)
+
+               lhs(2,1,cc,k) =  tmp2 * fjac(2,1,k+1)
+     >              - tmp1 * njac(2,1,k+1)
+               lhs(2,2,cc,k) =  tmp2 * fjac(2,2,k+1)
+     >              - tmp1 * njac(2,2,k+1)
+     >              - tmp1 * dz2
+               lhs(2,3,cc,k) =  tmp2 * fjac(2,3,k+1)
+     >              - tmp1 * njac(2,3,k+1)
+               lhs(2,4,cc,k) =  tmp2 * fjac(2,4,k+1)
+     >              - tmp1 * njac(2,4,k+1)
+               lhs(2,5,cc,k) =  tmp2 * fjac(2,5,k+1)
+     >              - tmp1 * njac(2,5,k+1)
+
+               lhs(3,1,cc,k) =  tmp2 * fjac(3,1,k+1)
+     >              - tmp1 * njac(3,1,k+1)
+               lhs(3,2,cc,k) =  tmp2 * fjac(3,2,k+1)
+     >              - tmp1 * njac(3,2,k+1)
+               lhs(3,3,cc,k) =  tmp2 * fjac(3,3,k+1)
+     >              - tmp1 * njac(3,3,k+1)
+     >              - tmp1 * dz3
+               lhs(3,4,cc,k) =  tmp2 * fjac(3,4,k+1)
+     >              - tmp1 * njac(3,4,k+1)
+               lhs(3,5,cc,k) =  tmp2 * fjac(3,5,k+1)
+     >              - tmp1 * njac(3,5,k+1)
+
+               lhs(4,1,cc,k) =  tmp2 * fjac(4,1,k+1)
+     >              - tmp1 * njac(4,1,k+1)
+               lhs(4,2,cc,k) =  tmp2 * fjac(4,2,k+1)
+     >              - tmp1 * njac(4,2,k+1)
+               lhs(4,3,cc,k) =  tmp2 * fjac(4,3,k+1)
+     >              - tmp1 * njac(4,3,k+1)
+               lhs(4,4,cc,k) =  tmp2 * fjac(4,4,k+1)
+     >              - tmp1 * njac(4,4,k+1)
+     >              - tmp1 * dz4
+               lhs(4,5,cc,k) =  tmp2 * fjac(4,5,k+1)
+     >              - tmp1 * njac(4,5,k+1)
+
+               lhs(5,1,cc,k) =  tmp2 * fjac(5,1,k+1)
+     >              - tmp1 * njac(5,1,k+1)
+               lhs(5,2,cc,k) =  tmp2 * fjac(5,2,k+1)
+     >              - tmp1 * njac(5,2,k+1)
+               lhs(5,3,cc,k) =  tmp2 * fjac(5,3,k+1)
+     >              - tmp1 * njac(5,3,k+1)
+               lhs(5,4,cc,k) =  tmp2 * fjac(5,4,k+1)
+     >              - tmp1 * njac(5,4,k+1)
+               lhs(5,5,cc,k) =  tmp2 * fjac(5,5,k+1)
+     >              - tmp1 * njac(5,5,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,0) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,i,j,0) )
+
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do k=1,ksize-1
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,k),
+     >                         rhs(1,i,j,k-1),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,k),
+     >                         lhs(1,1,cc,k-1),
+     >                         lhs(1,1,bb,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,k),
+     >                        lhs(1,1,cc,k),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,ksize),
+     >                         rhs(1,i,j,ksize-1),rhs(1,i,j,ksize))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,ksize),
+     >                         lhs(1,1,cc,ksize-1),
+     >                         lhs(1,1,bb,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,ksize),
+     >                       rhs(1,i,j,ksize) )
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+            do k=ksize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,k)*rhs(n,i,j,k+1)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve_vec.f
new file mode 100644
index 0000000..0c00842
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/BT/z_solve_vec.f
@@ -0,0 +1,443 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i, j, k, m, n, ksize
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors 
+c---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(ksize)
+!$omp& private(i,j,k,m,n)
+      do j = 1, grid_points(2)-2
+         do k = 0, ksize
+            do i = 1, grid_points(1)-2
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,k) = 0.0d+00
+               fjac(1,2,i,k) = 0.0d+00
+               fjac(1,3,i,k) = 0.0d+00
+               fjac(1,4,i,k) = 1.0d+00
+               fjac(1,5,i,k) = 0.0d+00
+
+               fjac(2,1,i,k) = - ( u(2,i,j,k)*u(4,i,j,k) ) 
+     >              * tmp2 
+               fjac(2,2,i,k) = u(4,i,j,k) * tmp1
+               fjac(2,3,i,k) = 0.0d+00
+               fjac(2,4,i,k) = u(2,i,j,k) * tmp1
+               fjac(2,5,i,k) = 0.0d+00
+
+               fjac(3,1,i,k) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(3,2,i,k) = 0.0d+00
+               fjac(3,3,i,k) = u(4,i,j,k) * tmp1
+               fjac(3,4,i,k) = u(3,i,j,k) * tmp1
+               fjac(3,5,i,k) = 0.0d+00
+
+               fjac(4,1,i,k) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 ) 
+     >              + c2 * qs(i,j,k)
+               fjac(4,2,i,k) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(4,3,i,k) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(4,4,i,k) = ( 2.0d+00 - c2 )
+     >              *  u(4,i,j,k) * tmp1 
+               fjac(4,5,i,k) = c2
+
+               fjac(5,1,i,k) = ( c2 * 2.0d0 * square(i,j,k) 
+     >              - c1 * u(5,i,j,k) )
+     >              * u(4,i,j,k) * tmp2
+               fjac(5,2,i,k) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(5,3,i,k) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i,k) = c1 * ( u(5,i,j,k) * tmp1 )
+     >              - c2
+     >              * ( qs(i,j,k)
+     >              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(5,5,i,k) = c1 * u(4,i,j,k) * tmp1
+
+               njac(1,1,i,k) = 0.0d+00
+               njac(1,2,i,k) = 0.0d+00
+               njac(1,3,i,k) = 0.0d+00
+               njac(1,4,i,k) = 0.0d+00
+               njac(1,5,i,k) = 0.0d+00
+
+               njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,k) =   c3c4 * tmp1
+               njac(2,3,i,k) =   0.0d+00
+               njac(2,4,i,k) =   0.0d+00
+               njac(2,5,i,k) =   0.0d+00
+
+               njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,k) =   0.0d+00
+               njac(3,3,i,k) =   c3c4 * tmp1
+               njac(3,4,i,k) =   0.0d+00
+               njac(3,5,i,k) =   0.0d+00
+
+               njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,k) =   0.0d+00
+               njac(4,3,i,k) =   0.0d+00
+               njac(4,4,i,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,i,k) =   0.0d+00
+
+               njac(5,1,i,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,k) = ( c1345 )* tmp1
+
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,i,0) = 0.0d0
+                  lhs(m,n,bb,i,0) = 0.0d0
+                  lhs(m,n,cc,i,0) = 0.0d0
+                  lhs(m,n,aa,i,ksize) = 0.0d0
+                  lhs(m,n,bb,i,ksize) = 0.0d0
+                  lhs(m,n,cc,i,ksize) = 0.0d0
+               end do
+               lhs(m,m,bb,i,0) = 1.0d0
+               lhs(m,m,bb,i,ksize) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+         do k = 1, ksize-1
+            do i = 1, grid_points(1)-2
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhs(1,1,aa,i,k) = - tmp2 * fjac(1,1,i,k-1)
+     >              - tmp1 * njac(1,1,i,k-1)
+     >              - tmp1 * dz1 
+               lhs(1,2,aa,i,k) = - tmp2 * fjac(1,2,i,k-1)
+     >              - tmp1 * njac(1,2,i,k-1)
+               lhs(1,3,aa,i,k) = - tmp2 * fjac(1,3,i,k-1)
+     >              - tmp1 * njac(1,3,i,k-1)
+               lhs(1,4,aa,i,k) = - tmp2 * fjac(1,4,i,k-1)
+     >              - tmp1 * njac(1,4,i,k-1)
+               lhs(1,5,aa,i,k) = - tmp2 * fjac(1,5,i,k-1)
+     >              - tmp1 * njac(1,5,i,k-1)
+
+               lhs(2,1,aa,i,k) = - tmp2 * fjac(2,1,i,k-1)
+     >              - tmp1 * njac(2,1,i,k-1)
+               lhs(2,2,aa,i,k) = - tmp2 * fjac(2,2,i,k-1)
+     >              - tmp1 * njac(2,2,i,k-1)
+     >              - tmp1 * dz2
+               lhs(2,3,aa,i,k) = - tmp2 * fjac(2,3,i,k-1)
+     >              - tmp1 * njac(2,3,i,k-1)
+               lhs(2,4,aa,i,k) = - tmp2 * fjac(2,4,i,k-1)
+     >              - tmp1 * njac(2,4,i,k-1)
+               lhs(2,5,aa,i,k) = - tmp2 * fjac(2,5,i,k-1)
+     >              - tmp1 * njac(2,5,i,k-1)
+
+               lhs(3,1,aa,i,k) = - tmp2 * fjac(3,1,i,k-1)
+     >              - tmp1 * njac(3,1,i,k-1)
+               lhs(3,2,aa,i,k) = - tmp2 * fjac(3,2,i,k-1)
+     >              - tmp1 * njac(3,2,i,k-1)
+               lhs(3,3,aa,i,k) = - tmp2 * fjac(3,3,i,k-1)
+     >              - tmp1 * njac(3,3,i,k-1)
+     >              - tmp1 * dz3 
+               lhs(3,4,aa,i,k) = - tmp2 * fjac(3,4,i,k-1)
+     >              - tmp1 * njac(3,4,i,k-1)
+               lhs(3,5,aa,i,k) = - tmp2 * fjac(3,5,i,k-1)
+     >              - tmp1 * njac(3,5,i,k-1)
+
+               lhs(4,1,aa,i,k) = - tmp2 * fjac(4,1,i,k-1)
+     >              - tmp1 * njac(4,1,i,k-1)
+               lhs(4,2,aa,i,k) = - tmp2 * fjac(4,2,i,k-1)
+     >              - tmp1 * njac(4,2,i,k-1)
+               lhs(4,3,aa,i,k) = - tmp2 * fjac(4,3,i,k-1)
+     >              - tmp1 * njac(4,3,i,k-1)
+               lhs(4,4,aa,i,k) = - tmp2 * fjac(4,4,i,k-1)
+     >              - tmp1 * njac(4,4,i,k-1)
+     >              - tmp1 * dz4
+               lhs(4,5,aa,i,k) = - tmp2 * fjac(4,5,i,k-1)
+     >              - tmp1 * njac(4,5,i,k-1)
+
+               lhs(5,1,aa,i,k) = - tmp2 * fjac(5,1,i,k-1)
+     >              - tmp1 * njac(5,1,i,k-1)
+               lhs(5,2,aa,i,k) = - tmp2 * fjac(5,2,i,k-1)
+     >              - tmp1 * njac(5,2,i,k-1)
+               lhs(5,3,aa,i,k) = - tmp2 * fjac(5,3,i,k-1)
+     >              - tmp1 * njac(5,3,i,k-1)
+               lhs(5,4,aa,i,k) = - tmp2 * fjac(5,4,i,k-1)
+     >              - tmp1 * njac(5,4,i,k-1)
+               lhs(5,5,aa,i,k) = - tmp2 * fjac(5,5,i,k-1)
+     >              - tmp1 * njac(5,5,i,k-1)
+     >              - tmp1 * dz5
+
+               lhs(1,1,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhs(1,2,bb,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k)
+               lhs(1,3,bb,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k)
+               lhs(1,4,bb,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k)
+               lhs(1,5,bb,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k)
+
+               lhs(2,1,bb,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k)
+               lhs(2,2,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhs(2,3,bb,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k)
+               lhs(2,4,bb,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k)
+               lhs(2,5,bb,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k)
+
+               lhs(3,1,bb,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k)
+               lhs(3,2,bb,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k)
+               lhs(3,3,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhs(3,4,bb,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k)
+               lhs(3,5,bb,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k)
+
+               lhs(4,1,bb,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k)
+               lhs(4,2,bb,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k)
+               lhs(4,3,bb,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k)
+               lhs(4,4,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhs(4,5,bb,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k)
+
+               lhs(5,1,bb,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k)
+               lhs(5,2,bb,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k)
+               lhs(5,3,bb,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k)
+               lhs(5,4,bb,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k)
+               lhs(5,5,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhs(1,1,cc,i,k) =  tmp2 * fjac(1,1,i,k+1)
+     >              - tmp1 * njac(1,1,i,k+1)
+     >              - tmp1 * dz1
+               lhs(1,2,cc,i,k) =  tmp2 * fjac(1,2,i,k+1)
+     >              - tmp1 * njac(1,2,i,k+1)
+               lhs(1,3,cc,i,k) =  tmp2 * fjac(1,3,i,k+1)
+     >              - tmp1 * njac(1,3,i,k+1)
+               lhs(1,4,cc,i,k) =  tmp2 * fjac(1,4,i,k+1)
+     >              - tmp1 * njac(1,4,i,k+1)
+               lhs(1,5,cc,i,k) =  tmp2 * fjac(1,5,i,k+1)
+     >              - tmp1 * njac(1,5,i,k+1)
+
+               lhs(2,1,cc,i,k) =  tmp2 * fjac(2,1,i,k+1)
+     >              - tmp1 * njac(2,1,i,k+1)
+               lhs(2,2,cc,i,k) =  tmp2 * fjac(2,2,i,k+1)
+     >              - tmp1 * njac(2,2,i,k+1)
+     >              - tmp1 * dz2
+               lhs(2,3,cc,i,k) =  tmp2 * fjac(2,3,i,k+1)
+     >              - tmp1 * njac(2,3,i,k+1)
+               lhs(2,4,cc,i,k) =  tmp2 * fjac(2,4,i,k+1)
+     >              - tmp1 * njac(2,4,i,k+1)
+               lhs(2,5,cc,i,k) =  tmp2 * fjac(2,5,i,k+1)
+     >              - tmp1 * njac(2,5,i,k+1)
+
+               lhs(3,1,cc,i,k) =  tmp2 * fjac(3,1,i,k+1)
+     >              - tmp1 * njac(3,1,i,k+1)
+               lhs(3,2,cc,i,k) =  tmp2 * fjac(3,2,i,k+1)
+     >              - tmp1 * njac(3,2,i,k+1)
+               lhs(3,3,cc,i,k) =  tmp2 * fjac(3,3,i,k+1)
+     >              - tmp1 * njac(3,3,i,k+1)
+     >              - tmp1 * dz3
+               lhs(3,4,cc,i,k) =  tmp2 * fjac(3,4,i,k+1)
+     >              - tmp1 * njac(3,4,i,k+1)
+               lhs(3,5,cc,i,k) =  tmp2 * fjac(3,5,i,k+1)
+     >              - tmp1 * njac(3,5,i,k+1)
+
+               lhs(4,1,cc,i,k) =  tmp2 * fjac(4,1,i,k+1)
+     >              - tmp1 * njac(4,1,i,k+1)
+               lhs(4,2,cc,i,k) =  tmp2 * fjac(4,2,i,k+1)
+     >              - tmp1 * njac(4,2,i,k+1)
+               lhs(4,3,cc,i,k) =  tmp2 * fjac(4,3,i,k+1)
+     >              - tmp1 * njac(4,3,i,k+1)
+               lhs(4,4,cc,i,k) =  tmp2 * fjac(4,4,i,k+1)
+     >              - tmp1 * njac(4,4,i,k+1)
+     >              - tmp1 * dz4
+               lhs(4,5,cc,i,k) =  tmp2 * fjac(4,5,i,k+1)
+     >              - tmp1 * njac(4,5,i,k+1)
+
+               lhs(5,1,cc,i,k) =  tmp2 * fjac(5,1,i,k+1)
+     >              - tmp1 * njac(5,1,i,k+1)
+               lhs(5,2,cc,i,k) =  tmp2 * fjac(5,2,i,k+1)
+     >              - tmp1 * njac(5,2,i,k+1)
+               lhs(5,3,cc,i,k) =  tmp2 * fjac(5,3,i,k+1)
+     >              - tmp1 * njac(5,3,i,k+1)
+               lhs(5,4,cc,i,k) =  tmp2 * fjac(5,4,i,k+1)
+     >              - tmp1 * njac(5,4,i,k+1)
+               lhs(5,5,cc,i,k) =  tmp2 * fjac(5,5,i,k+1)
+     >              - tmp1 * njac(5,5,i,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,0) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call binvcrhs( lhs(1,1,bb,i,0),
+     >                        lhs(1,1,cc,i,0),
+     >                        rhs(1,i,j,0) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do k=1,ksize-1
+!dir$ ivdep
+            do i = 1, grid_points(1)-2
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,k),
+     >                         rhs(1,i,j,k-1),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,k),
+     >                         lhs(1,1,cc,i,k-1),
+     >                         lhs(1,1,bb,i,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,k),
+     >                        lhs(1,1,cc,i,k),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call matvec_sub(lhs(1,1,aa,i,ksize),
+     >                         rhs(1,i,j,ksize-1),rhs(1,i,j,ksize))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,i,ksize),
+     >                         lhs(1,1,cc,i,ksize-1),
+     >                         lhs(1,1,bb,i,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,i,ksize),
+     >                       rhs(1,i,j,ksize) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+         do k=ksize-1,0,-1
+            do i = 1, grid_points(1)-2
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,k)*rhs(n,i,j,k+1)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/Makefile
new file mode 100644
index 0000000..7120dcc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/Makefile
@@ -0,0 +1,29 @@
+SHELL=/bin/sh
+BENCHMARK=cg
+BENCHMARKU=CG
+
+include ../config/make.def
+
+OBJS = cg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+cg.o:		cg.f  globals.h npbparams.h
+ifeq (${HOOKS}, 1)
+	        ${FCOMPILE} -DHOOKS cg.f
+else
+	        ${FCOMPILE} cg.f
+endif
+
+clean:
+	- rm -f *.o *~
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/README.carefully b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/README.carefully
new file mode 100644
index 0000000..cdcc366
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/README.carefully
@@ -0,0 +1,16 @@
+Note: please observe that in the routine conj_grad three 
+implementations of the sparse matrix-vector multiply have
+been supplied.  The default matrix-vector multiply is not
+loop unrolled.  The alternate implementations are unrolled
+to a depth of 2 and unrolled to a depth of 8.  Please
+experiment with these to find the fastest for your particular
+architecture.  If reporting timing results, any of these three may
+be used without penalty.
+
+Performance examples:
+The non-unrolled version of the multiply is actually (slightly: 
+maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the 
+unrolled-by-2 version below.   On the Cray t3d, the reverse is true, 
+i.e., the unrolled-by-two version is some 10% faster.  
+The unrolled-by-8 version below is significantly faster
+on the Cray t3d - overall speed of code is 1.5 times faster.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/cg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/cg.f
new file mode 100644
index 0000000..b91ad5f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/cg.f
@@ -0,0 +1,1162 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   C G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB CG code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: M. Yarrow
+c          C. Kuszmaul
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      program cg
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'globals.h'
+
+
+      common / main_int_mem /  colidx,     rowstr,
+     >                         iv,         arow,     acol
+      integer                  colidx(nz), rowstr(na+1),
+     >                         iv(nz+na),  arow(na), acol(naz)
+
+
+      common / main_flt_mem /  v,     aelt,     a,
+     >                         x,
+     >                         z,
+     >                         p,
+     >                         q,
+     >                         r
+      double precision         v(nz), aelt(naz), a(nz),
+     >                         x(na+2),
+     >                         z(na+2),
+     >                         p(na+2),
+     >                         q(na+2),
+     >                         r(na+2)
+
+
+
+      integer            i, j, k, it
+
+      double precision   zeta, randlc
+      external           randlc
+      double precision   rnorm
+      double precision   norm_temp1,norm_temp2
+
+      double precision   t, mflops, tmax
+      character          class
+      logical            verified
+      double precision   zeta_verify_value, epsilon, err
+
+      integer   fstatus
+      character t_names(t_last)*8
+!$    integer   omp_get_max_threads
+!$    external  omp_get_max_threads
+
+      do i = 1, T_last
+         call timer_clear( i )
+      end do
+
+      open(unit=2, file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_conj_grad) = 'conjgd'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+      call timer_start( T_init )
+
+      firstrow = 1
+      lastrow  = na
+      firstcol = 1
+      lastcol  = na
+
+
+      if( na .eq. 1400 .and. 
+     &    nonzer .eq. 7 .and. 
+     &    niter .eq. 15 .and.
+     &    shift .eq. 10.d0 ) then
+         class = 'S'
+         zeta_verify_value = 8.5971775078648d0
+      else if( na .eq. 7000 .and. 
+     &         nonzer .eq. 8 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 12.d0 ) then
+         class = 'W'
+         zeta_verify_value = 10.362595087124d0
+      else if( na .eq. 14000 .and. 
+     &         nonzer .eq. 11 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 20.d0 ) then
+         class = 'A'
+         zeta_verify_value = 17.130235054029d0
+      else if( na .eq. 75000 .and. 
+     &         nonzer .eq. 13 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 60.d0 ) then
+         class = 'B'
+         zeta_verify_value = 22.712745482631d0
+      else if( na .eq. 150000 .and. 
+     &         nonzer .eq. 15 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 110.d0 ) then
+         class = 'C'
+         zeta_verify_value = 28.973605592845d0
+      else if( na .eq. 1500000 .and. 
+     &         nonzer .eq. 21 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 500.d0 ) then
+         class = 'D'
+         zeta_verify_value = 52.514532105794d0
+      else if( na .eq. 9000000 .and. 
+     &         nonzer .eq. 26 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 1.5d3 ) then
+         class = 'E'
+         zeta_verify_value = 77.522164599383d0
+      else
+         class = 'U'
+      endif
+
+      write( *,1000 ) 
+      write( *,1001 ) na
+      write( *,1002 ) niter
+!$    write( *,1003 ) omp_get_max_threads()
+      write( *,* )
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - CG Benchmark', /)
+ 1001 format(' Size: ', i11 )
+ 1002 format(' Iterations:                  ', i5 )
+ 1003 format(' Number of available threads: ', i5)
+
+      naa = na
+      nzz = nz
+
+
+c---------------------------------------------------------------------
+c  Inialize random number generator
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,zeta)
+      tran    = 314159265.0D0
+      amult   = 1220703125.0D0
+      zeta    = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c  
+c---------------------------------------------------------------------
+      call makea(naa, nzz, a, colidx, rowstr, 
+     >           firstrow, lastrow, firstcol, lastcol, 
+     >           arow, acol, aelt, v, iv)
+!$omp barrier
+
+
+c---------------------------------------------------------------------
+c  Note: as a result of the above call to makea:
+c        values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1
+c        values of colidx which are col indexes go from firstcol --> lastcol
+c        So:
+c        Shift the col index vals from actual (firstcol --> lastcol ) 
+c        to local, i.e., (1 --> lastcol-firstcol+1)
+c---------------------------------------------------------------------
+!$omp do
+      do j=1,lastrow-firstrow+1
+         do k=rowstr(j),rowstr(j+1)-1
+            colidx(k) = colidx(k) - firstcol + 1
+         enddo
+      enddo
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+!$omp do
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+!$omp end do nowait
+!$omp do
+      do j=1, lastcol-firstcol+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = 0.0d0
+         p(j) = 0.0d0
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      zeta  = 0.0d0
+
+c---------------------------------------------------------------------
+c---->
+c  Do one iteration untimed to init all code and data page tables
+c---->                    (then reinit, start timing, to niter its)
+c---------------------------------------------------------------------
+      do it = 1, 1
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    rnorm )
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+!$omp parallel do default(shared) private(j)
+!$omp& reduction(+:norm_temp1,norm_temp2)
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+!$omp end parallel do
+
+         norm_temp2 = 1.0d0 / sqrt( norm_temp2 )
+
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) private(j)
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp2*z(j)    
+         enddo                           
+!$omp end parallel do
+
+
+      enddo                              ! end of do one iteration untimed
+
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+c
+c  
+c
+!$omp parallel do default(shared) private(i)
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+!$omp end parallel do
+
+      zeta  = 0.0d0
+
+      call timer_stop( T_init )
+
+      write (*, 2000) timer_read(T_init)
+ 2000 format(' Initialization time = ',f15.3,' seconds')
+
+#ifdef HOOKS
+      call roi_begin
+#endif
+      call timer_start( T_bench )
+
+c---------------------------------------------------------------------
+c---->
+c  Main Iteration for inverse power method
+c---->
+c---------------------------------------------------------------------
+      do it = 1, niter
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         if ( timeron ) call timer_start( T_conj_grad )
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    rnorm )
+         if ( timeron ) call timer_stop( T_conj_grad )
+
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+!$omp parallel do default(shared) private(j)
+!$omp& reduction(+:norm_temp1,norm_temp2)
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+!$omp end parallel do
+
+
+         norm_temp2 = 1.0d0 / sqrt( norm_temp2 )
+
+
+         zeta = shift + 1.0d0 / norm_temp1
+         if( it .eq. 1 ) write( *,9000 )
+         write( *,9001 ) it, rnorm, zeta
+
+ 9000    format( /,'   iteration           ||r||                 zeta' )
+ 9001    format( 4x, i5, 7x, e20.14, f20.13 )
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) private(j)
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp2*z(j)    
+         enddo                           
+!$omp end parallel do
+
+
+      enddo                              ! end of main iter inv pow meth
+
+      call timer_stop( T_bench )
+
+c---------------------------------------------------------------------
+c  End of timed section
+c---------------------------------------------------------------------
+#ifdef HOOKS
+      call roi_end
+#endif
+
+      t = timer_read( T_bench )
+
+
+      write(*,100)
+ 100  format(' Benchmark completed ')
+
+      epsilon = 1.d-10
+      if (class .ne. 'U') then
+
+c         err = abs( zeta - zeta_verify_value)
+         err = abs( zeta - zeta_verify_value )/zeta_verify_value
+         if( err .le. epsilon ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) zeta
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' Zeta is    ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300) 
+            write(*, 301) zeta
+            write(*, 302) zeta_verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' Zeta                ', E20.13)
+ 302        format(' The correct zeta is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) zeta
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+
+      if( t .ne. 0. ) then
+         mflops = float( 2*niter*na )
+     &               * ( 3.+float( nonzer*(nonzer+1) )
+     &                 + 25.*(5.+float( nonzer*(nonzer+1) ))
+     &                 + 3. ) / t / 1000000.0
+      else
+         mflops = 0.0
+      endif
+
+
+         call print_results('CG', class, na, 0, 0,
+     >                      niter, t,
+     >                      mflops, '          floating point', 
+     >                      verified, npbversion, compiletime,
+     >                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+
+ 600  format( i4, 2e19.12)
+
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(T_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=1, t_last
+         t = timer_read(i)
+         if (i.eq.t_init) then
+            write(*,810) t_names(i), t
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+            if (i.eq.t_conj_grad) then
+               t = tmax - t
+               write(*,820) 'rest', t, t*100./tmax
+            endif
+         endif
+ 810     format(2x,a8,':',f9.3:'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+
+      end                              ! end main
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine conj_grad ( colidx,
+     >                       rowstr,
+     >                       x,
+     >                       z,
+     >                       a,
+     >                       p,
+     >                       q,
+     >                       r,
+     >                       rnorm )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  Floaging point arrays here are named as in NPB1 spec discussion of 
+c  CG algorithm
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+
+      include 'globals.h'
+
+
+      double precision   x(*),
+     >                   z(*),
+     >                   a(nzz)
+      integer            colidx(nzz), rowstr(naa+1)
+
+      double precision   p(*),
+     >                   q(*),
+     >                   r(*)
+
+
+      integer   j, k
+      integer   cgit, cgitmax
+
+      double precision   d, sum, rho, rho0, alpha, beta, rnorm, suml
+
+      data      cgitmax / 25 /
+
+
+      rho = 0.0d0
+      sum = 0.0d0
+
+!$omp parallel default(shared) private(j,k,cgit,suml,alpha,beta)
+!$omp&  shared(d,rho0,rho,sum)
+
+c---------------------------------------------------------------------
+c  Initialize the CG algorithm:
+c---------------------------------------------------------------------
+!$omp do
+      do j=1,naa+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = x(j)
+         p(j) = r(j)
+      enddo
+!$omp end do
+
+
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+!$omp do reduction(+:rho)
+      do j=1, lastcol-firstcol+1
+         rho = rho + r(j)*r(j)
+      enddo
+!$omp end do
+
+c---------------------------------------------------------------------
+c---->
+c  The conj grad iteration loop
+c---->
+c---------------------------------------------------------------------
+      do cgit = 1, cgitmax
+
+!$omp master
+c---------------------------------------------------------------------
+c  Save a temporary of rho and initialize reduction variables
+c---------------------------------------------------------------------
+         rho0 = rho
+         d = 0.d0
+         rho = 0.d0
+!$omp end master
+!$omp barrier
+
+c---------------------------------------------------------------------
+c  q = A.p
+c  The partition submatrix-vector multiply: use workspace w
+c---------------------------------------------------------------------
+C
+C  NOTE: this version of the multiply is actually (slightly: maybe %5) 
+C        faster on the sp2 on 16 nodes than is the unrolled-by-2 version 
+C        below.   On the Cray t3d, the reverse is true, i.e., the 
+C        unrolled-by-two version is some 10% faster.  
+C        The unrolled-by-8 version below is significantly faster
+C        on the Cray t3d - overall speed of code is 1.5 times faster.
+C
+!$omp do
+         do j=1,lastrow-firstrow+1
+            suml = 0.d0
+            do k=rowstr(j),rowstr(j+1)-1
+               suml = suml + a(k)*p(colidx(k))
+            enddo
+            q(j) = suml
+         enddo
+!$omp end do
+
+CC          do j=1,lastrow-firstrow+1
+CC             i = rowstr(j) 
+CC             iresidue = mod( rowstr(j+1)-i, 2 )
+CC             sum1 = 0.d0
+CC             sum2 = 0.d0
+CC             if( iresidue .eq. 1 )
+CC      &          sum1 = sum1 + a(i)*p(colidx(i))
+CC             do k=i+iresidue, rowstr(j+1)-2, 2
+CC                sum1 = sum1 + a(k)  *p(colidx(k))
+CC                sum2 = sum2 + a(k+1)*p(colidx(k+1))
+CC             enddo
+CC             q(j) = sum1 + sum2
+CC          enddo
+
+CC          do j=1,lastrow-firstrow+1
+CC             i = rowstr(j) 
+CC             iresidue = mod( rowstr(j+1)-i, 8 )
+CC             suml = 0.d0
+CC             do k=i,i+iresidue-1
+CC                suml = suml +  a(k)*p(colidx(k))
+CC             enddo
+CC             do k=i+iresidue, rowstr(j+1)-8, 8
+CC                suml = suml + a(k  )*p(colidx(k  ))
+CC      &                   + a(k+1)*p(colidx(k+1))
+CC      &                   + a(k+2)*p(colidx(k+2))
+CC      &                   + a(k+3)*p(colidx(k+3))
+CC      &                   + a(k+4)*p(colidx(k+4))
+CC      &                   + a(k+5)*p(colidx(k+5))
+CC      &                   + a(k+6)*p(colidx(k+6))
+CC      &                   + a(k+7)*p(colidx(k+7))
+CC             enddo
+CC             q(j) = suml
+CC          enddo
+            
+
+
+c---------------------------------------------------------------------
+c  Obtain p.q
+c---------------------------------------------------------------------
+!$omp do reduction(+:d)
+         do j=1, lastcol-firstcol+1
+            d = d + p(j)*q(j)
+         enddo
+!$omp end do
+
+
+c---------------------------------------------------------------------
+c  Obtain alpha = rho / (p.q)
+c---------------------------------------------------------------------
+         alpha = rho0 / d
+
+c---------------------------------------------------------------------
+c  Obtain z = z + alpha*p
+c  and    r = r - alpha*q
+c---------------------------------------------------------------------
+!$omp do reduction(+:rho)
+         do j=1, lastcol-firstcol+1
+            z(j) = z(j) + alpha*p(j)
+            r(j) = r(j) - alpha*q(j)
+c         enddo
+            
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+c         do j=1, lastcol-firstcol+1
+            rho = rho + r(j)*r(j)
+         enddo
+!$omp end do
+
+c---------------------------------------------------------------------
+c  Obtain beta:
+c---------------------------------------------------------------------
+         beta = rho / rho0
+
+c---------------------------------------------------------------------
+c  p = r + beta*p
+c---------------------------------------------------------------------
+!$omp do
+         do j=1, lastcol-firstcol+1
+            p(j) = r(j) + beta*p(j)
+         enddo
+!$omp end do
+
+
+      enddo                             ! end of do cgit=1,cgitmax
+
+
+c---------------------------------------------------------------------
+c  Compute residual norm explicitly:  ||r|| = ||x - A.z||
+c  First, form A.z
+c  The partition submatrix-vector multiply
+c---------------------------------------------------------------------
+!$omp do
+      do j=1,lastrow-firstrow+1
+         suml = 0.d0
+         do k=rowstr(j),rowstr(j+1)-1
+            suml = suml + a(k)*z(colidx(k))
+         enddo
+         r(j) = suml
+      enddo
+!$omp end do
+
+
+c---------------------------------------------------------------------
+c  At this point, r contains A.z
+c---------------------------------------------------------------------
+!$omp do reduction(+:sum)
+      do j=1, lastcol-firstcol+1
+         suml = x(j) - r(j)         
+         sum  = sum + suml*suml
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      rnorm = sqrt( sum )
+
+
+
+      return
+      end                               ! end of routine conj_grad
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine makea( n, nz, a, colidx, rowstr, 
+     >                  firstrow, lastrow, firstcol, lastcol,
+     >                  arow, acol, aelt, v, iv )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit            none
+      include             'npbparams.h'
+
+      integer             n, nz
+      integer             firstrow, lastrow, firstcol, lastcol
+      integer             colidx(nz), rowstr(n+1)
+      integer             iv(n+nz), arow(n), acol(nonzer+1,n)
+      double precision    aelt(nonzer+1,n), v(nz)
+      double precision    a(nz)
+
+c---------------------------------------------------------------------
+c       generate the test problem for benchmark 6
+c       makea generates a sparse matrix with a
+c       prescribed sparsity distribution
+c
+c       parameter    type        usage
+c
+c       input
+c
+c       n            i           number of cols/rows of matrix
+c       nz           i           nonzeros as declared array size
+c       rcond        r*8         condition number
+c       shift        r*8         main diagonal shift
+c
+c       output
+c
+c       a            r*8         array for nonzeros
+c       colidx       i           col indices
+c       rowstr       i           row pointers
+c
+c       workspace
+c
+c       iv, arow, acol i
+c       v, aelt        r*8
+c---------------------------------------------------------------------
+
+      integer          i, iouter, ivelt, nzv, nn1
+      integer          ivc(nonzer+1)
+      double precision vc(nonzer+1)
+
+      integer          myid, num_threads, ilow, ihigh
+      common /tinfo/   myid, num_threads, ilow, ihigh
+!$omp threadprivate (/tinfo/)
+
+c---------------------------------------------------------------------
+c      nonzer is approximately  (int(sqrt(nnza /n)));
+c---------------------------------------------------------------------
+
+      external          sparse, sprnvc, vecset
+!$    integer           omp_get_num_threads, omp_get_thread_num
+!$    external          omp_get_num_threads, omp_get_thread_num
+      integer           max_threads
+      parameter        (max_threads=1024)
+      integer           work, last_n(0:max_threads)
+      save              last_n
+
+c---------------------------------------------------------------------
+c    nn1 is the smallest power of two not less than n
+c---------------------------------------------------------------------
+
+      nn1 = 1
+ 50   continue
+        nn1 = 2 * nn1
+        if (nn1 .lt. n) goto 50
+
+c---------------------------------------------------------------------
+c  Generate nonzero positions and save for the use in sparse.
+c---------------------------------------------------------------------
+      num_threads = 1
+!$    num_threads = omp_get_num_threads()
+      myid = 0
+!$    myid  = omp_get_thread_num()
+      if (num_threads .gt. max_threads) then
+         if (myid .eq. 0) write(*,100) num_threads, max_threads
+100      format(' Warning: num_threads',i6,
+     &          ' exceeded an internal limit',i6)
+         num_threads = max_threads
+      endif
+      work  = (n + num_threads - 1)/num_threads
+      ilow  = work * myid + 1
+      ihigh = ilow + work - 1
+      if (ihigh .gt. n) ihigh = n
+
+      do iouter = 1, ihigh
+         nzv = nonzer
+         call sprnvc( n, nzv, nn1, vc, ivc )
+         if ( iouter .ge. ilow ) then
+            call vecset( n, vc, ivc, nzv, iouter, .5D0 )
+            arow(iouter) = nzv
+            do ivelt = 1, nzv
+               acol(ivelt, iouter) = ivc(ivelt)
+               aelt(ivelt, iouter) = vc(ivelt)
+            enddo
+         endif
+      enddo
+!$omp barrier
+
+c---------------------------------------------------------------------
+c       ... make the sparse matrix from list of elements with duplicates
+c           (v and iv are used as  workspace)
+c---------------------------------------------------------------------
+      call sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol, 
+     >             aelt, firstrow, lastrow, last_n, 
+     >             v, iv(1), iv(nz+1), rcond, shift )
+      return
+
+      end
+c-------end   of makea------------------------------
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol, 
+     >                   aelt, firstrow, lastrow, last_n, 
+     >                   v, iv, nzloc, rcond, shift )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            colidx(*), rowstr(*), iv(*)
+      integer            firstrow, lastrow, last_n(0:*)
+      integer            n, nz, nonzer, arow(*), acol(nonzer+1,*)
+      double precision   a(*), aelt(nonzer+1,*), v(*), rcond, shift
+
+c---------------------------------------------------------------------
+c       rows range from firstrow to lastrow
+c       the rowstr pointers are defined for nrows = lastrow-firstrow+1 values
+c---------------------------------------------------------------------
+      integer            nzloc(n), nrows
+
+      integer          myid, num_threads, ilow, ihigh
+      common /tinfo/   myid, num_threads, ilow, ihigh
+!$omp threadprivate (/tinfo/)
+
+c---------------------------------------------------
+c       generate a sparse matrix from a list of
+c       [col, row, element] tri
+c---------------------------------------------------
+
+      integer            i, j, j1, j2, nza, k, kk, nzrow, jcol
+      double precision   xi, size, scale, ratio, va
+
+c---------------------------------------------------------------------
+c    how many rows of result
+c---------------------------------------------------------------------
+      nrows = lastrow - firstrow + 1
+      j1 = ilow + 1
+      j2 = ihigh + 1
+
+c---------------------------------------------------------------------
+c     ...count the number of triples in each row
+c---------------------------------------------------------------------
+      do j = j1, j2
+         rowstr(j) = 0
+      enddo
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i)
+            if (j.ge.ilow .and. j.le.ihigh) then
+               j = j + 1
+               rowstr(j) = rowstr(j) + arow(i)
+            endif
+         end do
+      end do
+
+      if (myid .eq. 0) then
+         rowstr(1) = 1
+         j1 = 1
+      endif
+      do j = j1+1, j2
+         rowstr(j) = rowstr(j) + rowstr(j-1)
+      enddo
+      if (myid .lt. num_threads) last_n(myid) = rowstr(j2)
+!$omp barrier
+
+      nzrow = 0
+      if (myid .lt. num_threads) then
+         do i = 0, myid-1
+            nzrow = nzrow + last_n(i)
+         end do
+      endif
+      if (nzrow .gt. 0) then
+         do j = j1, j2
+            rowstr(j) = rowstr(j) + nzrow
+         enddo
+      endif
+!$omp barrier
+      nza = rowstr(nrows+1) - 1
+
+c---------------------------------------------------------------------
+c     ... rowstr(j) now is the location of the first nonzero
+c           of row j of a
+c---------------------------------------------------------------------
+
+      if (nza .gt. nz) then
+!$omp master
+         write(*,*) 'Space for matrix elements exceeded in sparse'
+         write(*,*) 'nza, nzmax = ',nza, nz
+!$omp end master
+         stop
+      endif
+
+
+c---------------------------------------------------------------------
+c     ... preload data pages
+c---------------------------------------------------------------------
+      do j = ilow, ihigh
+         do k = rowstr(j), rowstr(j+1)-1
+             v(k) = 0.d0
+             iv(k) = 0
+         enddo
+         nzloc(j) = 0
+      enddo
+
+c---------------------------------------------------------------------
+c     ... generate actual values by summing duplicates
+c---------------------------------------------------------------------
+
+      size = 1.0D0
+      ratio = rcond ** (1.0D0 / dfloat(n))
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i)
+
+            if (j .lt. ilow .or. j .gt. ihigh) goto 60
+
+            scale = size * aelt(nza, i)
+            do nzrow = 1, arow(i)
+               jcol = acol(nzrow, i)
+               va = aelt(nzrow, i) * scale
+
+c---------------------------------------------------------------------
+c       ... add the identity * rcond to the generated matrix to bound
+c           the smallest eigenvalue from below by rcond
+c---------------------------------------------------------------------
+               if (jcol .eq. j .and. j .eq. i) then
+                  va = va + rcond - shift
+               endif
+
+               do k = rowstr(j), rowstr(j+1)-1
+                  if (iv(k) .gt. jcol) then
+c---------------------------------------------------------------------
+c       ... insert colidx here orderly
+c---------------------------------------------------------------------
+                     do kk = rowstr(j+1)-2, k, -1
+                        if (iv(kk) .gt. 0) then
+                           v(kk+1)  = v(kk)
+                           iv(kk+1) = iv(kk)
+                        endif
+                     enddo
+                     iv(k) = jcol
+                     v(k)  = 0.d0
+                     goto 40
+                  else if (iv(k) .eq. 0) then
+                     iv(k) = jcol
+                     goto 40
+                  else if (iv(k) .eq. jcol) then
+c---------------------------------------------------------------------
+c       ... mark the duplicated entry
+c---------------------------------------------------------------------
+                     nzloc(j) = nzloc(j) + 1
+                     goto 40
+                  endif
+               enddo
+               print *,'internal error in sparse: i=',i
+               stop
+   40          continue
+               v(k) = v(k) + va
+            enddo
+   60       continue
+         enddo
+         size = size * ratio
+      enddo
+!$omp barrier
+
+
+c---------------------------------------------------------------------
+c       ... remove empty entries and generate final results
+c---------------------------------------------------------------------
+      do j = ilow+1, ihigh
+         nzloc(j) = nzloc(j) + nzloc(j-1)
+      enddo
+      if (myid .lt. num_threads) last_n(myid) = nzloc(ihigh)
+!$omp barrier
+
+      nzrow = 0
+      if (myid .lt. num_threads) then
+         do i = 0, myid-1
+            nzrow = nzrow + last_n(i)
+         end do
+      endif
+      if (nzrow .gt. 0) then
+         do j = ilow, ihigh
+            nzloc(j) = nzloc(j) + nzrow
+         enddo
+      endif
+!$omp barrier
+
+!$omp do
+      do j = 1, nrows
+         if (j .gt. 1) then
+            j1 = rowstr(j) - nzloc(j-1)
+         else
+            j1 = 1
+         endif
+         j2 = rowstr(j+1) - nzloc(j) - 1
+         nza = rowstr(j)
+         do k = j1, j2
+            a(k) = v(nza)
+            colidx(k) = iv(nza)
+            nza = nza + 1
+         enddo
+      enddo
+!$omp end do
+!$omp do
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) - nzloc(j-1)
+      enddo
+!$omp end do
+      nza = rowstr(nrows+1) - 1
+
+
+CC       write (*, 11000) nza
+      return
+11000   format ( //,'final nonzero count in sparse ',
+     1            /,'number of nonzeros       = ', i16 )
+      end
+c-------end   of sparse-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sprnvc( n, nz, nn1, v, iv )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   v(*)
+      integer            n, nz, nn1, iv(*)
+      common /urando/    amult, tran
+      double precision   amult, tran
+!$omp threadprivate (/urando/)
+
+
+c---------------------------------------------------------------------
+c       generate a sparse n-vector (v, iv)
+c       having nzv nonzeros
+c
+c       mark(i) is set to 1 if position i is nonzero.
+c       mark is all zero on entry and is reset to all zero before exit
+c       this corrects a performance bug found by John G. Lewis, caused by
+c       reinitialization of mark on every one of the n calls to sprnvc
+c---------------------------------------------------------------------
+
+        integer            nzv, ii, i, icnvrt
+
+        external           randlc, icnvrt
+        double precision   randlc, vecelt, vecloc
+
+
+        nzv = 0
+
+100     continue
+        if (nzv .ge. nz) goto 110
+
+         vecelt = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c   generate an integer between 1 and n in a portable manner
+c---------------------------------------------------------------------
+         vecloc = randlc(tran, amult)
+         i = icnvrt(vecloc, nn1) + 1
+         if (i .gt. n) goto 100
+
+c---------------------------------------------------------------------
+c  was this integer generated already?
+c---------------------------------------------------------------------
+         do ii = 1, nzv
+            if (iv(ii) .eq. i) goto 100
+         enddo
+         nzv = nzv + 1
+         v(nzv) = vecelt
+         iv(nzv) = i
+         goto 100
+110     continue
+
+      return
+      end
+c-------end   of sprnvc-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      function icnvrt(x, ipwr2)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   x
+      integer            ipwr2, icnvrt
+
+c---------------------------------------------------------------------
+c    scale a double precision number x in (0,1) by a power of 2 and chop it
+c---------------------------------------------------------------------
+      icnvrt = int(ipwr2 * x)
+
+      return
+      end
+c-------end   of icnvrt-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine vecset(n, v, iv, nzv, i, val)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            n, iv(*), nzv, i, k
+      double precision   v(*), val
+
+c---------------------------------------------------------------------
+c       set ith element of sparse vector (v, iv) with
+c       nzv nonzeros to val
+c---------------------------------------------------------------------
+
+      logical set
+
+      set = .false.
+      do k = 1, nzv
+         if (iv(k) .eq. i) then
+            v(k) = val
+            set  = .true.
+         endif
+      enddo
+      if (.not. set) then
+         nzv     = nzv + 1
+         v(nzv)  = val
+         iv(nzv) = i
+      endif
+      return
+      end
+c-------end   of vecset-----------------------------
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/globals.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/globals.h
new file mode 100644
index 0000000..313fd10
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/globals.h
@@ -0,0 +1,106 @@
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c  Note: please observe that in the routine conj_grad three 
+c  implementations of the sparse matrix-vector multiply have
+c  been supplied.  The default matrix-vector multiply is not
+c  loop unrolled.  The alternate implementations are unrolled
+c  to a depth of 2 and unrolled to a depth of 8.  Please
+c  experiment with these to find the fastest for your particular
+c  architecture.  If reporting timing results, any of these three may
+c  be used without penalty.
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c  Class specific parameters: 
+c  It appears here for reference only.
+c  These are their values, however, this info is imported in the npbparams.h
+c  include file, which is written by the sys/setparams.c program.
+c---------------------------------------------------------------------
+
+C----------
+C  Class S:
+C----------
+CC       parameter( na=1400, 
+CC      >           nonzer=7, 
+CC      >           shift=10., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class W:
+C----------
+CC       parameter( na=7000,
+CC      >           nonzer=8, 
+CC      >           shift=12., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class A:
+C----------
+CC       parameter( na=14000,
+CC      >           nonzer=11, 
+CC      >           shift=20., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class B:
+C----------
+CC       parameter( na=75000, 
+CC      >           nonzer=13, 
+CC      >           shift=60., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class C:
+C----------
+CC       parameter( na=150000, 
+CC      >           nonzer=15, 
+CC      >           shift=110., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class D:
+C----------
+CC       parameter( na=1500000, 
+CC      >           nonzer=21, 
+CC      >           shift=500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class E:
+C----------
+CC       parameter( na=9000000, 
+CC      >           nonzer=26, 
+CC      >           shift=1500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+
+
+      integer    nz, naz
+      parameter( nz = na*(nonzer+1)*(nonzer+1) )
+      parameter( naz = na*(nonzer+1) )
+
+
+      common / partit_size  /  naa, nzz, 
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol
+      integer                  naa, nzz, 
+     >                         firstrow, 
+     >                         lastrow, 
+     >                         firstcol, 
+     >                         lastcol
+
+      common /urando/          amult, tran
+      double precision         amult, tran
+!$omp threadprivate (/urando/)
+
+      external         timer_read
+      double precision timer_read
+
+      integer T_init, T_bench, T_conj_grad, T_last
+      parameter (T_init=1, T_bench=2, T_conj_grad=3, T_last=3)
+      logical timeron
+      common /timers/ timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/ADC.par b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/ADC.par
new file mode 100644
index 0000000..05f9ce7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/ADC.par
@@ -0,0 +1,5 @@
+attrNum=12
+measuresNum=1
+tuplesNum=100
+INVERSE_ENDIAN=0
+fileName=ADC
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/Makefile
new file mode 100644
index 0000000..1151606
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/Makefile
@@ -0,0 +1,40 @@
+SHELL=/bin/sh
+BENCHMARK=dc
+BENCHMARKU=DC
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = adc.o dc.o extbuild.o rbt.o jobcntl.o \
+	${COMMON}/c_print_results.o  \
+	${COMMON}/c_timers.o ${COMMON}/c_wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o
+endif
+
+
+# npbparams.h is provided for backward compatibility with NPB compilation
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS} 
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+ifeq (${HOOKS}, 1)
+	${CCOMPILE} -DHOOKS $<
+else
+	${CCOMPILE} $<
+endif
+
+adc.o:      adc.c npbparams.h
+dc.o:       dc.c adcc.h adc.h macrodef.h npbparams.h
+extbuild.o: extbuild.c adcc.h adc.h macrodef.h npbparams.h
+rbt.o:      rbt.c adcc.h adc.h rbt.h macrodef.h npbparams.h
+jobcntl.o:  jobcntl.c adcc.h adc.h macrodef.h npbparams.h
+
+clean:
+	- rm -f *.o 
+	- rm -f npbparams.h core
+	- rm -f {../,}ADC.{logf,view,dat,viewsz,groupby,chunks}.* 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/README
new file mode 100644
index 0000000..0c895fc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/README
@@ -0,0 +1,83 @@
+1. Compilation
+   DC benchmark uses the same directory tree as NPB3.0 (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary
+   (an example of make.def provided in DC directory). 
+   Then
+      make dc CLASS=S
+
+   If a compiler complains about type 'int64' already defined, add
+   "-DHAS_INT64" to the CFLAGS list in make.def.
+
+2. OpenMP environment needs to be set before program can be executed.
+   First set the number of the threads:
+   setenv OMP_NUM_THREADS 4
+   Then to fix OpenMP implemantations on some machines:
+   limit stacksize unlimit
+   If running on Altix 
+   setenv KMP_MONITOR_STACKSIZE 50m
+
+3. Run
+   A text file ADC.par is used to set DC parameters when the class 
+   is undefined (U). 
+   The file has 5 lines. The lines with 'key' words attrNum, measuresNum, 
+   and tuplesNum define the number of dimensions, measures,
+   and input tuples respectively. There a special parameter INVERSE_ENDIAN
+   allows us to create data in non-native endian format (INVERSE_ENDIAN=1). 
+   The last parameter(fileName) specifies a DC file set name, including
+   (optionally) a full path to a directory which will contain all
+   DC related files.
+
+   An example of the DC parameter file is as follows:
+
+   attrNum=9
+   measuresNum=1
+   tuplesNum=125000
+   class=U
+   INVERSE_ENDIAN=0
+   fileName=ADC
+   
+   After parameter are set run benchmark
+   bin/dc.S 100000000 DC/ADC.par 
+   where 100000000 is the memory size allowed to be allocated for 
+   the in-core data.
+   
+4. DC processing modes
+   The DC benchmark can be run in two modes (in-core and out-of-core).
+   A desirable mode should be set before compilation in the file adc.h.
+   If a flag IN_CORE is on, the benchmark will calculate all views in main
+   memory. In this case we can use an additional flag VIEW_FILE_OUTPUT to
+   allow writing all views into disk files.
+                
+   If the flag IN_CORE is off, the DC benchmark will run in a regular mode
+   using disks to store interim and result data which may not fit in main
+   memory.
+
+   _FILE_OFFSET_BITS=64 _LARGEFILE64_SOURCE -are standard compiler flags
+   which allow DC to work with files larger than 2GB. 
+
+   OPTIMIZATION turns on some nonstandard DC optimizations such as obtaining
+   a view by scanning existing views. These optimizations do not always 
+   guarantee reduction in the computing time.
+
+5. Tested architectures:
+   SUN Ultrasparc 60
+   SUNFire 880
+   Origin 2000, 3000, 3800
+   MAC G4 
+   Xeon + Mandrake Linux
+   SGI Altix
+
+6. setparams utility is used for generation of the npbparams.h file only 
+   for compatibility with the existing make facility of NPB. By the same
+   reason CLASS is appended to the DC executable name. It does not limit 
+   the sizes the executable can perform. The class is an input value
+   specified in ADC.par file. Providing ADC.par overrides compiled 
+   defaults in npbparams.h file.
+
+7. Known issues
+   If the benchmark runs out of disk space, a message like
+   "Write error from WriteToFile()" may not be printed. Instead,
+   the benchmark returns with UNSUCCESSFUL verification. In this case 
+   users are advised to check whether the file system is full before 
+   reporting a problem with the benchmark.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.c
new file mode 100644
index 0000000..26f88c4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.c
@@ -0,0 +1,636 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+
+#define BlockSize 1024
+
+void swap4(void * num){
+  char t, *p;
+  p = (char *) num;
+  t = *p; *p = *(p + 3); *(p + 3) = t;
+  t = *(p + 1); *(p + 1) = *(p + 2); *(p + 2) = t;
+}
+void swap8(void * num){
+  char t, *p;
+  p = (char *) num;	  
+  t = *p; *p = *(p + 7); *(p + 7) = t;
+  t = *(p + 1); *(p + 1) = *(p + 6); *(p + 6) = t;
+  t = *(p + 2); *(p + 2) = *(p + 5); *(p + 5) = t;
+  t = *(p + 3); *(p + 3) = *(p + 4); *(p + 4) = t;
+}
+void initADCpar(ADC_PAR *par){
+  par->ndid=0;
+  par->dim=5;
+  par->mnum=1;
+  par->tuplenum=100;
+/*  par->isascii=1; */
+  par->inverse_endian=0;
+  par->filename="ADC";
+  par->clss='U';
+}
+int ParseParFile(char* parfname,ADC_PAR *par);
+int GenerateADC(ADC_PAR *par);
+
+typedef struct Factorization{
+  long int *mlt;
+  long int *exp;
+  long int dim;
+} Factorization;
+
+void ShowFactorization(Factorization *nmbfct){
+  int i=0;
+  for(i=0;i<nmbfct->dim;i++){
+    if(nmbfct->mlt[i]==1){
+      if(i==0) fprintf(stdout,"prime.");
+      break;
+    }
+    if(i>0) fprintf(stdout,"*");
+    if(nmbfct->exp[i]==1)
+      fprintf(stdout,"%ld",nmbfct->mlt[i]);    
+    else 
+      fprintf(stdout,"%ld^%ld",nmbfct->mlt[i],
+                               nmbfct->exp[i]);
+  }
+  fprintf(stdout,"\n");
+}
+
+long int adcprime[]={
+  421,601,631,701,883,
+  419,443,647,21737,31769,
+  1427,18353,22817,34337,98717,
+  3527,8693,9677,11093,18233};
+  
+long int ListFirstPrimes(long int mpr,long int *prlist){
+/*
+  fprintf(stdout,"ListFirstPrimes: listing primes less than %ld...\n",
+                 mpr);
+*/
+  long int prnum=0;
+  int composed=0;
+  long int nmb=0,j=0;
+  prlist[prnum++]=2;
+  prlist[prnum++]=3;
+  prlist[prnum++]=5;
+  prlist[prnum++]=7;
+  for(nmb=8;nmb<mpr;nmb++){
+    composed=0;
+    for(j=0;prlist[j]*prlist[j]<=nmb;j++){
+      if(nmb-prlist[j]*((long int)(nmb/prlist[j]))==0){
+        composed=1;
+	break;
+      }
+    }
+    if(composed==0) prlist[prnum++]=nmb;
+  }
+/*  fprintf(stdout,"ListFirstPrimes: Done.\n"); */
+  return prnum;
+}
+
+long long int LARGE_NUM=0x4FFFFFFFFFFFFFFFLL;
+long long int maxprmfctr=59;
+
+long long int GetLCM(long long int mask,
+                     Factorization **fctlist,
+		     long int *adcexpons){
+  int i=0,j=0,k=0;
+  int* expons=(int*) calloc(maxprmfctr+1,sizeof(int));
+  long long int LCM=1;
+  long int pr=2;
+  int genexp=1,lexp=1,fct=2;
+
+  for(i=0;i<maxprmfctr+1;i++)expons[i]=0;
+  i=0;
+  while(mask>0){
+    if(mask==2*(mask/2)){
+      mask=mask>>1;
+      i++;  
+      continue;
+    }
+    pr=adcprime[i];
+    genexp=adcexpons[i];
+/*
+  fprintf(stdout,"[%ld,%ld]\n",pr,genexp);
+  ShowFactorization(fctlist[genexp]);
+*/
+    for(j=0;j<fctlist[pr-1]->dim;j++){
+      fct=fctlist[pr-1]->mlt[j];
+      lexp=fctlist[pr-1]->exp[j];
+
+      for(k=0;k<fctlist[genexp]->dim;k++){
+        if(fctlist[genexp]->mlt[k]==1) break;
+        if(fct!=fctlist[genexp]->mlt[k]) continue;
+        lexp-=fctlist[genexp]->exp[k];
+	break;
+      }
+      if(expons[fct]<lexp)expons[fct]=lexp;
+    }
+    mask=mask>>1;
+    i++;
+  }
+/*
+for(i=0;i<maxprmfctr;i++){
+  if(expons[i]>0) fprintf(stdout,"*%ld^%ld",i,expons[i]);
+}
+fprintf(stdout,"\n");
+*/
+  for(i=0;i<=maxprmfctr;i++){
+    while(expons[i]>0){
+      LCM*=i;
+      if(LCM>LARGE_NUM/maxprmfctr) return LCM;
+      expons[i]--;
+    }
+  }
+/*  fprintf(stdout,"==== %lld\n",LCM); */
+  free(expons);
+  return LCM;
+}
+void ExtendFactors(long int nmb,long int firstdiv,
+                   Factorization *nmbfct,Factorization **fctlist){
+  Factorization *divfct=fctlist[nmb/firstdiv];
+  int fdivused=0;
+  int multnum=0;
+  int i=0;
+/*  fprintf(stdout,"==== %lld %ld %ld\n",divfct->dim,nmb,firstdiv); */
+   for(i=0;i<divfct->dim;i++){
+    if(divfct->mlt[i]==1){
+      if(fdivused==0){
+        nmbfct->mlt[multnum]=firstdiv;
+        nmbfct->exp[multnum]=1;   
+      }
+      break;
+    }
+    if(divfct->mlt[i]<firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i];
+      multnum++;
+    }else if(divfct->mlt[i]==firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i]+1;   
+      fdivused=1;
+    }else{
+      int j=i;
+      if(fdivused==0) j=i+1;
+      nmbfct->mlt[j]=divfct->mlt[i];
+      nmbfct->exp[j]=divfct->exp[i];    
+    }
+  }
+}
+void GetFactorization(long int prnum,long int *prlist,
+                            Factorization **fctlist){
+/*fprintf(stdout,"GetFactorization: factorizing first %ld numbers.\n",
+                prnum);*/
+  long int i=0,j=0;
+  Factorization *fct=(Factorization*)malloc(2*sizeof(Factorization)); 
+  long int len=0,isft=0,div=1,firstdiv=1;
+
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=2;
+  fct->exp[0]=1;
+  fctlist[2]=fct;
+
+  fct=(Factorization*)malloc(2*sizeof(Factorization));
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=3;
+  fct->exp[0]=1;
+  fctlist[3]=fct;
+ 
+  for(i=0;i<prlist[prnum-1];i++){
+    len=0;
+    isft=i;
+    while(isft>0){
+      len++;
+      isft=isft>>1;
+    }
+    fct=(Factorization*)malloc(2*sizeof(Factorization));
+    fct->dim=len;
+    if (len==0) len=1;
+    fct->mlt=(long int*)malloc(len*sizeof(long int));
+    fct->exp=(long int*)malloc(len*sizeof(long int));
+    for(j=0;j<fct->dim;j++){
+      fct->mlt[j]=1;
+      fct->exp[j]=0;
+    }
+    div=1;
+    for(j=0;prlist[j]*prlist[j]<=i;j++){
+      firstdiv=prlist[j];
+      if(i-firstdiv*((long int)i/firstdiv)==0){
+        div=firstdiv;
+        if(firstdiv*firstdiv==i){
+          fct->mlt[0]=firstdiv;
+          fct->exp[0]=2;	  
+	}else{
+	  ExtendFactors(i,firstdiv,fct,fctlist);
+        }
+	break;
+      }
+    }
+    if(div==1){
+      fct->mlt[0]=i;
+      fct->exp[0]=1;   
+    }
+    fctlist[i]=fct;
+/*
+     ShowFactorization(fct);
+*/
+  }
+/*  fprintf(stdout,"GetFactorization: Done.\n"); */
+}
+
+long int adcexp[]={
+  11,13,17,19,23,
+  23,29,31,37,41,	     	  
+  41,43,47,53,59,	     	  
+  3,5,7,11,13};
+long int adcexpS[]={
+  11,13,17,19,23};
+long int adcexpW[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  23,29,31,2*2,2*2*19};
+long int adcexpA[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                    
+  2*23,2*2*2*2,2*2*2*2*2*23,2*2*2*2*2,2*2*23};
+long int adcexpB[]={  
+  2*2*7,2*2*2*5,2*3*7,2*2*5*7,2*3*7*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                      
+  2*31,2*2*2*2*31,2*2*2*2*2*31,2*2*2*2*2*29,2*2*29,
+  2*43,2*2,2*2,2*2*47,2*2*2*43};  
+long int UpPrimeLim=100000;
+
+typedef struct dc_view{
+  long long int vsize;
+  long int vidx;
+} DC_view;
+
+int CompareSizesByValue( const void* sz0, const void* sz1) {
+long long int *size0=(long long int*)sz0,
+              *size1=(long long int*)sz1;
+  int res=0;
+  if(*size0-*size1>0) res=1;
+  else if(*size0-*size1<0) res=-1;
+  return res;
+}
+int CompareViewsBySize( const void* vw0, const void* vw1) {
+DC_view *lvw0=(DC_view *)vw0, *lvw1=(DC_view *)vw1;
+  int res=0;
+  if(lvw0->vsize>lvw1->vsize) res=1;
+  else if(lvw0->vsize<lvw1->vsize) res=-1;
+  else if(lvw0->vidx>lvw1->vidx) res=1;
+  else if(lvw0->vidx<lvw1->vidx) res=-1;
+  return res;
+}
+
+int CalculateVeiwSizes(ADC_PAR *par){
+  unsigned long long totalInBytes = 0;
+  unsigned long long nViewDims, nCubeTuples = 0;
+ 
+  const char *adcfname=par->filename;
+  int NDID=par->ndid;
+  char clss=par->clss;
+  int dcdim=par->dim;
+  long long int tnum=par->tuplenum;
+  long long int i=0,j=0;
+  Factorization  
+    **fctlist=(Factorization **) calloc(UpPrimeLim,sizeof(Factorization *));
+  long int *prlist=(long int *) calloc(UpPrimeLim,sizeof(long int));
+  int prnum=ListFirstPrimes(UpPrimeLim,prlist);
+  DC_view *dcview=(DC_view *)calloc((1<<dcdim),sizeof(DC_view));
+  const char* vszefname0;
+  char *vszefname=NULL;
+  FILE* view=NULL;
+  int minvn=1, maxvn=(1<<dcdim), vinc=1;
+  long idx=0;
+
+  GetFactorization(prnum,prlist,fctlist); 
+  for(i=1;i<(1<<dcdim);i++){   
+    long long int LCM=1;
+    switch(clss){
+      case 'U':
+        LCM=GetLCM(i,fctlist,adcexp);
+      break;
+      case 'S':
+        LCM=GetLCM(i,fctlist,adcexpS);
+      break;
+      case 'W':
+        LCM=GetLCM(i,fctlist,adcexpW);
+      break;
+      case 'A':
+        LCM=GetLCM(i,fctlist,adcexpA);
+      break;
+      case 'B':
+        LCM=GetLCM(i,fctlist,adcexpB);
+      break;
+    }
+    if(LCM>tnum) LCM=tnum;
+    dcview[i].vsize=LCM;
+    dcview[i].vidx=i;
+  }
+  for(i=0;i<UpPrimeLim;i++){
+    if(!fctlist[i]) continue;
+    if(fctlist[i]->mlt) free(fctlist[i]->mlt); 
+    if(fctlist[i]->exp) free(fctlist[i]->exp); 
+    free(fctlist[i]);
+  }
+  free(fctlist);
+  free(prlist);
+   
+  vszefname0="view.sz";
+  vszefname=(char*)calloc(BlockSize,sizeof(char));
+  sprintf(vszefname,"%s.%s.%d",adcfname,vszefname0,NDID);
+  if(!(view = fopen(vszefname, "w+")) ) {
+    fprintf(stderr,"CalculateVeiwSizes: Can't open file: %s\n",vszefname);
+    return 0;
+  }
+  qsort( dcview, (1<<dcdim), sizeof(DC_view),CompareViewsBySize);	
+
+  switch(clss){
+    case 'U':
+      vinc=1<<3;
+    break;
+    case 'S':
+    break;
+    case 'W':
+    break;
+    case 'A':
+      vinc=1<<6;
+    break;
+    case 'B':
+      vinc=1<<14;
+    break;
+  }
+   for(i=minvn;i<maxvn;i+=vinc){   
+    nViewDims = 0;
+    fprintf(view,"Selection:");
+    idx=dcview[i].vidx;
+    for(j=0;j<dcdim;j++) 
+      if((idx>>j)&0x1==1) { fprintf(view," %lld",j+1); nViewDims++;}
+    fprintf(view,"\nView Size: %lld\n",dcview[i].vsize);
+
+    totalInBytes += (8+4*nViewDims)*dcview[i].vsize;
+    nCubeTuples += dcview[i].vsize;
+
+  }
+  fprintf(view,"\nTotal in bytes: %lld  Number of tuples: %lld\n", 
+          totalInBytes, nCubeTuples);
+  
+  fclose(view);
+  free(dcview);
+  fprintf(stdout,"View sizes are written into %s\n",vszefname);
+  free(vszefname);
+  return 1;
+}
+
+int ParseParFile(char* parfname,ADC_PAR *par){
+  char line[BlockSize];
+  FILE* parfile=NULL;
+  char* pos=strchr(parfname,'.');
+  int linenum=0,i=0;
+  const char *kwd;
+
+  if(!(parfile = fopen(parfname, "r")) ) {
+    fprintf(stderr,"ParseParFile: Can't open file: %s\n",parfname);
+    return 0;
+  }
+  if(pos) pos=strchr(pos+1,'.');
+  if(pos) sscanf(pos+1,"%d",&(par->ndid));
+  linenum=0;
+  while(fgets(&line[0],BlockSize,parfile)){
+    i=0;
+    kwd=adcKeyword[i];
+    while(kwd){
+      if(strstr(line,"#")) {
+        ;/*comment line, do nothing*/
+      }else if(strstr(line,kwd)){
+        char *pos=line+strlen(kwd)+1;
+        switch(i){
+          case 0:
+            sscanf(pos,"%d",&(par->dim));
+          break;
+          case 1:
+            sscanf(pos,"%d",&(par->mnum));
+          break;
+          case 2:
+            sscanf(pos,"%lld",&(par->tuplenum));
+          break;
+          case 3:
+/*            sscanf(pos,"%d",&(par->isascii));*/
+          break;
+          case 4:
+            sscanf(pos,"%d",&(par->inverse_endian));
+          break;
+          case 5:
+            par->filename=(char*) malloc(strlen(pos)*sizeof(char));
+            sscanf(pos,"%s",par->filename);
+          break;
+          case 6:
+            sscanf(pos,"%c",&(par->clss));
+          break;
+        }
+        break;        
+      }
+      i++;
+      kwd=adcKeyword[i];
+    }
+    linenum++;
+  }
+  fclose(parfile);
+  switch(par->clss){/* overwriting parameters according the class */
+    case 'S':
+      par->dim=5;
+      par->mnum=1;
+      par->tuplenum=1000;
+    break;
+    case 'W':
+      par->dim=10;
+      par->mnum=1;
+      par->tuplenum=100000;
+    break;
+    case 'A':
+      par->dim=15;
+      par->mnum=1;
+      par->tuplenum=1000000;
+    break;
+    case 'B':
+      par->dim=20;
+      par->mnum=1;
+      par->tuplenum=10000000;
+    break;
+  }  
+  return 1;
+}
+int WriteADCPar(ADC_PAR *par,char* fname){
+  char *lname=(char*) calloc(BlockSize,sizeof(char));
+  FILE *parfile=NULL;
+
+  sprintf(lname,"%s",fname);
+  parfile=fopen(lname,"w");
+  if(!parfile){
+    fprintf(stderr,"WriteADCPar: can't open file %s\n",lname);
+    return 0;
+  }
+  fprintf(parfile,"attrNum=%d\n",par->dim);
+  fprintf(parfile,"measuresNum=%d\n",par->mnum);
+  fprintf(parfile,"tuplesNum=%lld\n",par->tuplenum);
+  fprintf(parfile,"class=%c\n",par->clss);
+/*  fprintf(parfile,"isASCII=%d\n",par->isascii); */
+  fprintf(parfile,"INVERSE_ENDIAN=%d\n",par->inverse_endian);
+  fprintf(parfile,"fileName=%s\n",par->filename);
+  fclose(parfile);
+  return 1;
+}
+void ShowADCPar(ADC_PAR *par){
+  fprintf(stdout,"********************* ADC paramters\n");
+  fprintf(stdout," id		%d\n",par->ndid);
+  fprintf(stdout," attributes 	%d\n",par->dim);
+  fprintf(stdout," measures   	%d\n",par->mnum);
+  fprintf(stdout," tuples     	%lld\n",par->tuplenum);
+  fprintf(stdout," class	\t%c\n",par->clss);
+  fprintf(stdout," filename       %s\n",par->filename);
+  fprintf(stdout,"***********************************\n");
+}
+
+long int adcgen[]={
+  2,7,3,2,2,
+  2,2,5,31,7,
+  2,3,3,3,2,
+  5,2,2,2,3};
+  
+int GetNextTuple(int dcdim, int measnum,
+                 long long int* attr,long long int* meas,
+		 char clss){
+  static int tuplenum=0;
+  static const int maxdim=20;
+  static int measbound=31415;
+  int i=0,j=0;
+  int maxattr=0;
+  static long int seed[20];
+  long int *locexp=NULL;
+
+  if(dcdim>maxdim){
+    fprintf(stderr,"GetNextTuple: number of dcdim is too large:%d",
+                    dcdim);
+    return 0;
+  }
+  if(measnum>measbound){
+    fprintf(stderr,"GetNextTuple: number of mes is too large:%d",
+                    measnum);
+    return 0;
+  }
+  locexp=adcexp;
+  switch(clss){
+    case 'S':
+    locexp=adcexpS;
+    break;
+    case 'W':
+    locexp=adcexpW;
+    break;
+    case 'A':
+    locexp=adcexpA;
+    break;
+    case 'B':
+    locexp=adcexpB;
+    break;
+  }  
+  if(tuplenum==0){
+    for(i=0;i<dcdim;i++){
+      int tmpgen=adcgen[i];
+      for(j=0;j<locexp[i]-1;j++){
+        tmpgen*=adcgen[i];
+	tmpgen=tmpgen%adcprime[i];
+      }
+      adcgen[i]=tmpgen;
+    }
+    fprintf(stdout,"Prime \tGenerator \tSeed\n");
+    for(i=0;i<dcdim;i++){
+      seed[i]=(adcprime[i]+1)/2;
+      fprintf(stdout," %ld\t %ld\t\t %ld\n",adcprime[i],adcgen[i],seed[i]);
+     }
+  }
+  tuplenum++;
+  maxattr=0;
+  for(i=0;i<dcdim;i++){
+    attr[i]=seed[i]*adcgen[i];
+    attr[i]-=adcprime[i]*((long long int)attr[i]/adcprime[i]); 
+    seed[i]=attr[i];
+    if(seed[i]>maxattr) maxattr=seed[i];
+  }		     	  
+  for(i=0;i<measnum;i++){
+    meas[i]=(long long int)(seed[i]*maxattr);
+    meas[i]-=measbound*(meas[i]/measbound);
+  }		     	  
+  return 1;
+}
+
+int GenerateADC(ADC_PAR *par){
+  int dcdim=par->dim,
+      mesnum=par->mnum,
+      tplnum=par->tuplenum;
+  char *adcfname=(char*)calloc(BlockSize,sizeof(char));
+  
+  FILE *adc;
+  int i=0,j=0;
+  long long int* attr=NULL,*mes=NULL; 
+/*
+   if(par->isascii==1){
+    sprintf(adcfname,"%s.tpl.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "w+"))) {
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+      return 0;
+    }
+  }else{
+*/
+  sprintf(adcfname,"%s.dat.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "wb+"))){
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+       return 0;
+    }
+/*  } */
+  attr=(long long int *)malloc(dcdim*sizeof(long long int));
+  mes=(long long int *)malloc(mesnum*sizeof(long long int));
+
+  fprintf(stdout,"\nGenerateADC: writing %d tuples of %d attributes and %d measures to %s\n",
+		  tplnum,dcdim,mesnum,adcfname);
+   for(i=0;i<tplnum;i++){
+    if(!GetNextTuple(dcdim,mesnum,attr,mes,par->clss)) return 0;
+/*
+     if(par->isascii==1){
+      for(int j=0;j<dcdim;j++)fprintf(adc,"%lld ",attr[j]);
+      for(int j=0;j<mesnum;j++)fprintf(adc,"%lld ",mes[j]);
+      fprintf(adc,"\n");
+    }else{
+*/
+      for(j=0;j<mesnum;j++){ 
+    	long long mv =  mes[j];
+	    if(par->inverse_endian==1) swap8(&mv);
+	    fwrite(&mv, 8, 1, adc); 
+      }
+      for(j=0;j<dcdim;j++){ 
+    	int av = attr[j]; 
+	if(par->inverse_endian==1) swap4(&av);
+	fwrite(&av, 4, 1, adc); 
+      }
+    }
+/*  } */
+  fclose(adc);
+  fprintf(stdout,"Binary ADC file %s ",adcfname);
+  fprintf(stdout,"have been generated.\n");
+  free(attr);
+  free(mes);
+  free(adcfname);
+  CalculateVeiwSizes(par);
+  return 1;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.h
new file mode 100644
index 0000000..e11f243
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adc.h
@@ -0,0 +1,167 @@
+#if !adc_h
+#define adc_h 1
+
+/* For checking of L2-cache performance influence */ 
+/*#define IN_CORE_*/
+/*#define VIEW_FILE_OUTPUT*/ /* it can be used with IN_CORE only */
+
+/* Optimizations: prefixed views and share-sorted views */
+/*#define OPTIMIZATION*/
+
+#ifdef WINNT
+#ifndef HAS_INT64
+typedef __int64             int64;
+typedef int                 int32;
+#endif
+typedef unsigned __int64   uint64;
+typedef unsigned int       uint32;
+#else
+#ifndef HAS_INT64
+typedef long long           int64;
+typedef int                 int32;
+#endif
+typedef unsigned long long uint64;
+typedef unsigned int       uint32;
+#endif
+
+#include "adcc.h"
+#include "rbt.h"
+
+static int measbound=31415;   /* upper limit on a view measre bound */
+
+enum { smallestParent, prefixedParent, sharedSortParent, noneParent };
+
+static const char* adcKeyword[]={
+  "attrNum",
+  "measuresNum",
+  "tuplesNum",
+  "INVERSE_ENDIAN",
+  "fileName",
+  "class",
+  NULL
+};
+
+typedef struct ADCpar{
+  int ndid;
+  int dim;
+  int mnum;
+  long long int tuplenum;
+  int inverse_endian;
+  const char *filename;
+  char clss;
+} ADC_PAR;
+
+typedef struct {
+    int32 ndid;
+   char   clss;
+   char          adcName[MAX_FILE_FULL_PATH_SIZE];
+   char   adcInpFileName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 nd; 
+   uint32 nm;
+   uint32 nInputRecs;
+   uint32 memoryLimit;
+   uint32 nTasks;
+   /*  FILE *statf; */
+} ADC_VIEW_PARS;
+
+typedef struct job_pool{ 
+   uint32 grpb; 
+   uint32 nv;
+   uint32 nRows; 
+    int64 viewOffset; 
+} JOB_POOL;
+
+typedef struct layer{
+   uint32 layerIndex;
+   uint32 layerQuantityLimit;
+   uint32 layerCurrentPopulation;
+} LAYER;
+
+typedef struct chunks{
+   uint32 curChunkNum;
+    int64 chunkOffset;
+   uint32 posSubChunk;
+   uint32 curSubChunk;
+} CHUNKS;
+
+typedef struct tuplevsize {
+    uint64 viewsize;
+    uint64 tuple;
+} TUPLE_VIEWSIZE;
+
+typedef struct tupleones {
+    uint32 nOnes;
+    uint64 tuple;
+} TUPLE_ONES;
+
+typedef struct {
+   char adcName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 retCode;
+   uint32 verificationFailed;
+   uint32 swapIt;
+   uint32 nTasks;
+   uint32 taskNumber;
+    int32 ndid;
+
+   uint32 nTopDims; /* given number of dimension attributes */
+   uint32 nm;       /* number of measures */ 
+   uint32 nd;       /* number of parent's dimensions */
+   uint32 nv;       /* number of child's dimensions */
+
+   uint32 nInputRecs;
+   uint32 nViewRows; 
+   uint32 totalOfViewRows;
+   uint32 nParentViewRows;
+
+    int64 viewOffset;
+    int64 accViewFileOffset;
+
+   uint32 inpRecSize;
+   uint32 outRecSize;
+
+   uint32 memoryLimit;
+ unsigned char * memPool;
+   uint32 * inpDataBuffer;
+
+   RBTree *tree;
+
+   uint32 numberOfChunks;
+   CHUNKS *chunksParams;
+
+     char       adcLogFileName[MAX_FILE_FULL_PATH_SIZE];
+     char          inpFileName[MAX_FILE_FULL_PATH_SIZE];
+     char         viewFileName[MAX_FILE_FULL_PATH_SIZE];
+     char       chunksFileName[MAX_FILE_FULL_PATH_SIZE];
+     char      groupbyFileName[MAX_FILE_FULL_PATH_SIZE];
+     char adcViewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+     char    viewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+
+     FILE *logf;
+     FILE *inpf;
+     FILE *viewFile;   
+     FILE *fileOfChunks;
+     FILE *groupbyFile;
+     FILE *adcViewSizesFile;
+     FILE *viewSizesFile;
+   
+    int64     mSums[MAX_NUM_OF_MEAS];
+   uint32 selection[MAX_NUM_OF_DIMS];
+    int64 checksums[MAX_NUM_OF_MEAS]; /* view checksums */
+    int64 totchs[MAX_NUM_OF_MEAS];    /* checksums of a group of views */
+
+ JOB_POOL *jpp;
+    LAYER *lpp;
+   uint32 nViewLimit;
+   uint32 groupby;
+   uint32 smallestParentLevel;
+   uint32 parBinRepTuple;
+   uint32 nRowsToRead;
+   uint32 fromParent;
+
+   uint64 totalViewFileSize; /* in bytes */
+   uint32 numberOfMadeViews;
+   uint32 numberOfViewsMadeFromInput;
+   uint32 numberOfPrefixedGroupbys;
+   uint32 numberOfSharedSortGroupbys;
+} ADC_VIEW_CNTL;
+#endif /* adc_h */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adcc.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adcc.h
new file mode 100644
index 0000000..fe52718
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/adcc.h
@@ -0,0 +1,82 @@
+/*
+!-------------------------------------------------------------------------!
+!				                                    	                  !
+!		           N A S   G R I D   B E N C H M A R K S                  !
+!									                                      !
+!		                	C + +	V E R S I O N		                  !
+!									                                      !
+!			                       A D C C . H 		                      !
+!									                                      !
+!-------------------------------------------------------------------------!
+!									                                      !
+!    The the file contains comnstants definitions used for                !
+!    building veiws.                                                      !
+!									                                      !
+!    Permission to use, copy, distribute and modify this software	      !
+!    for any purpose with or without fee is hereby granted.		          !
+!    We request, however, that all derived work reference the		      !
+!    NAS Grid Benchmarks 3.0 or GridNPB3.0. This software is provided	  !
+!    "as is" without expressed or implied warranty.			              !
+!									                                      !
+!    Information on GridNPB3.0, including the concept of		          !
+!    the NAS Grid Benchmarks, the specifications, source code,  	      !
+!    results and information on how to submit new results,		          !
+!    is available at:							                          !
+!									                                      !
+!	  http://www.nas.nasa.gov/Software/NPB  			                  !
+!									                                      !
+!    Send comments or suggestions to  ngb@nas.nasa.gov  		          !
+!    Send bug reports to	      ngb@nas.nasa.gov  		              !
+!									                                      !
+!	   E-mail:  ngb@nas.nasa.gov					                      !
+!	   Fax:     (650) 604-3957					                          !
+!									                                      !
+!-------------------------------------------------------------------------!
+! GridNPB3.0 C++ version						                          !
+!	  Michael Frumkin, Leonid Shabanov				                      !
+!-------------------------------------------------------------------------!
+*/
+#ifndef _ADCC_CONST_DEFS_H_
+#define _ADCC_CONST_DEFS_H_
+
+/*#define WINNT*/
+#define UNIX
+
+#define ADC_OK                        0
+#define ADC_WRITE_FAILED              1
+#define ADC_INTERNAL_ERROR            2
+#define ADC_TREE_DESTROY_FAILURE      3
+#define ADC_FILE_OPEN_FAILURE         4
+#define ADC_MEMORY_ALLOCATION_FAILURE 5
+#define ADC_FILE_DELETE_FAILURE       6
+#define ADC_VERIFICATION_FAILED       7
+#define ADC_SHMEMORY_FAILURE          8
+
+#define SSA_BUFFER_SIZE     (1024*1024)
+#define MAX_NUMBER_OF_TASKS         256
+
+#define MAX_PAR_FILE_LINE_SIZE      512
+#define MAX_FILE_FULL_PATH_SIZE     512
+#define MAX_ADC_NAME_SIZE            32
+
+#define DIM_FSZ                       4
+#define MSR_FSZ                       8
+
+#define MAX_NUM_OF_DIMS              20
+#define MAX_NUM_OF_MEAS               4
+
+#define MAX_NUM_OF_CHUNKS          1024      
+#define MAX_PARAM_LINE_SIZE        1024
+
+#define OUTPUT_BUFFER_SIZE (MAX_NUM_OF_DIMS + (MSR_FSZ/4)*MAX_NUM_OF_MEAS)
+#define MAX_VIEW_REC_SIZE ((DIM_FSZ*MAX_NUM_OF_DIMS)+(MSR_FSZ*MAX_NUM_OF_MEAS))     
+#define MAX_VIEW_ROW_SIZE_IN_INTS (MAX_NUM_OF_DIMS + 2*MAX_NUM_OF_MEAS)
+#define MLB32  0x80000000
+
+#ifdef WINNT
+#define MLB    0x8000000000000000
+#else
+#define MLB 0x8000000000000000LL
+#endif
+
+#endif /*  _ADCC_CONST_DEFS_H_ */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/dc.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/dc.c
new file mode 100644
index 0000000..2fe6754
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/dc.c
@@ -0,0 +1,338 @@
+/*
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      O p e n M P     V E R S I O N                      !
+!                                                                         !
+!                                   D C                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    DC creates all specifided data-cube views in parallel.               !
+!    Refer to NAS Technical Report 03-005 for details.                    !
+!    It calculates all groupbys in a top down manner using well known     !
+!    heuristics and optimizations.                                        !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+! Author: Michael Frumkin                                                 !
+!         Leonid Shabanov                                                 !
+!-------------------------------------------------------------------------!
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <ctype.h>
+#include <math.h>
+
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+
+#include "adc.h"
+#include "macrodef.h"
+#include "npbparams.h"
+
+#ifdef UNIX
+#include <sys/types.h>
+#include <unistd.h>
+
+#define MAX_TIMERS 64  /* NPB maximum timers */
+  void    timer_clear(int);
+  void    timer_start(int);
+  void    timer_stop(int); 
+  double  timer_read(int);
+#endif
+
+void c_print_results( char   *name,
+                      char   clss,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+void initADCpar(ADC_PAR *par);
+int ParseParFile(char* parfname, ADC_PAR *par); 
+int GenerateADC(ADC_PAR *par);
+void ShowADCPar(ADC_PAR *par);
+int32 DC(ADC_VIEW_PARS *adcpp);
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp);
+
+void roi_begin_();
+void roi_end_();
+
+#define BlockSize 1024
+
+int main ( int argc, char * argv[] ) 
+{
+  ADC_PAR *parp;
+  ADC_VIEW_PARS *adcpp;
+  int32 retCode;
+
+  fprintf(stdout,"\n\n NAS Parallel Benchmarks (NPB3.3-OMP) - DC Benchmark\n\n" );
+  if(argc!=3){
+    fprintf(stdout," No Paramter file. Using compiled defaults\n");
+  }
+  if(argc>3 || (argc>1 && !isdigit(argv[1][0]))){
+    fprintf(stderr,"Usage: <program name> <amount of memory>\n");
+    fprintf(stderr,"       <file of parameters>\n");
+    fprintf(stderr,"Example: bin/dc.S 1000000 DC/ADC.par\n");
+    fprintf(stderr,"The last argument, (a parameter file) can be skipped\n");
+    exit(1);
+  }
+
+  if(  !(parp = (ADC_PAR*) malloc(sizeof(ADC_PAR)))
+     ||!(adcpp = (ADC_VIEW_PARS*) malloc(sizeof(ADC_VIEW_PARS)))){
+     PutErrMsg("main: malloc failed")
+     exit(1);
+  }
+  initADCpar(parp);
+  parp->clss=CLASS;
+  if(argc!=3){
+    parp->dim=attrnum;
+    parp->tuplenum=input_tuples;    
+  }else if( (argc==3)&&(!ParseParFile(argv[2], parp))) {
+    PutErrMsg("main.ParseParFile failed")
+    exit(1);
+  }
+  ShowADCPar(parp); 
+  if(!GenerateADC(parp)) {
+     PutErrMsg("main.GenerateAdc failed")
+     exit(1);
+  }
+
+  adcpp->ndid = parp->ndid;  
+  adcpp->clss = parp->clss;
+  adcpp->nd = parp->dim;
+  adcpp->nm = parp->mnum;
+  adcpp->nTasks = 1;
+
+  if(argc>=2)
+    adcpp->memoryLimit = atoi(argv[1]);
+  else
+    adcpp->memoryLimit = 0;
+  if(adcpp->memoryLimit <= 0){
+    /* size of rb-tree with tuplenum nodes */
+    adcpp->memoryLimit = parp->tuplenum*(50+5*parp->dim); 
+    fprintf(stdout,"Estimated rb-tree size = %d \n", adcpp->memoryLimit);
+  }
+  adcpp->nInputRecs = parp->tuplenum;
+  strcpy(adcpp->adcName, parp->filename);
+  strcpy(adcpp->adcInpFileName, parp->filename);
+
+  if((retCode=DC(adcpp))) {
+     PutErrMsg("main.DC failed")
+     fprintf(stderr, "main.ParRun failed: retcode = %d\n", retCode);
+     exit(1);
+  }
+
+  if(parp)  { free(parp);   parp = 0; }
+  if(adcpp) { free(adcpp); adcpp = 0; }
+  return 0;
+}
+
+int32		 CloseAdcView(ADC_VIEW_CNTL *adccntl);  
+int32		 PartitionCube(ADC_VIEW_CNTL *avp);				
+ADC_VIEW_CNTL *NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum);
+int32		 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntl);
+
+int32 DC(ADC_VIEW_PARS *adcpp) {
+   int32 itsk=0;
+   double t_total=0.0;
+   int verified;
+
+   typedef struct { 
+      int    verificationFailed;
+      uint32 totalViewTuples;
+      uint64 totalViewSizesInBytes;
+      uint32 totalNumberOfMadeViews;
+      uint64 checksum;
+      double tm_max;
+   } PAR_VIEW_ST;
+   
+   PAR_VIEW_ST *pvstp;
+
+   pvstp = (PAR_VIEW_ST*) malloc(sizeof(PAR_VIEW_ST));
+   pvstp->verificationFailed = 0;
+   pvstp->totalViewTuples = 0;
+   pvstp->totalViewSizesInBytes = 0;
+   pvstp->totalNumberOfMadeViews = 0;
+   pvstp->checksum = 0;
+
+#ifdef _OPENMP    
+   adcpp->nTasks=omp_get_max_threads();
+   fprintf(stdout,"\nNumber of available threads:  %d\n", adcpp->nTasks);
+   if (adcpp->nTasks > MAX_NUMBER_OF_TASKS) {
+      adcpp->nTasks = MAX_NUMBER_OF_TASKS;
+      fprintf(stdout,"Warning: Maximum number of tasks reached: %d\n",
+              adcpp->nTasks);
+   }
+
+#ifdef HOOKS
+   roi_begin_();
+#endif
+
+#pragma omp parallel shared(pvstp) private(itsk)
+#endif
+  {
+   double tm0=0;
+   int itimer=0;
+   ADC_VIEW_CNTL *adccntlp;
+#ifdef _OPENMP
+   itsk=omp_get_thread_num();
+#endif
+   adccntlp = NewAdcViewCntl(adcpp, itsk);
+
+   if (!adccntlp) { 
+      PutErrMsg("ParRun.NewAdcViewCntl: returned NULL")
+      adccntlp->verificationFailed=1;
+   }else{
+     adccntlp->verificationFailed = 0;
+     if (adccntlp->retCode!=0) {
+   	fprintf(stderr, 
+   		 "DC.NewAdcViewCntl: return code = %d\n",
+   						adccntlp->retCode); 
+     }
+   }
+
+   if (!adccntlp->verificationFailed) {
+     if( PartitionCube(adccntlp) ) {
+        PutErrMsg("DC.PartitionCube failed");
+     }
+     timer_clear(itimer);
+     timer_start(itimer);
+     if( ComputeGivenGroupbys(adccntlp) ) {
+        PutErrMsg("DC.ComputeGivenGroupbys failed");
+     }
+     timer_stop(itimer);
+     tm0 = timer_read(itimer);
+   }
+#ifdef _OPENMP    
+#pragma omp critical
+#endif
+   {
+     if(pvstp->tm_max<tm0) pvstp->tm_max=tm0;
+     pvstp->verificationFailed += adccntlp->verificationFailed;
+     if (!adccntlp->verificationFailed) {
+       pvstp->totalNumberOfMadeViews += adccntlp->numberOfMadeViews;
+       pvstp->totalViewSizesInBytes += adccntlp->totalViewFileSize;
+       pvstp->totalViewTuples += adccntlp->totalOfViewRows;
+       pvstp->checksum += adccntlp->totchs[0];
+     }   
+   }
+   if(CloseAdcView(adccntlp)) {
+     PutErrMsg("ParRun.CloseAdcView: is failed");
+     adccntlp->verificationFailed = 1;
+   }
+ } /* omp parallel */
+
+#ifdef HOOKS
+   roi_end_();
+#endif
+
+   t_total=pvstp->tm_max; 
+ 
+   pvstp->verificationFailed=Verify(pvstp->checksum,adcpp);
+   verified = (pvstp->verificationFailed == -1)? -1 :
+              (pvstp->verificationFailed ==  0)?  1 : 0;
+
+   fprintf(stdout,"\n*** DC Benchmark Results:\n");
+   fprintf(stdout," Benchmark Time   = %20.3f\n", t_total);
+   fprintf(stdout," Input Tuples     =         %12d\n", (int) adcpp->nInputRecs);
+   fprintf(stdout," Number of Views  =         %12d\n",
+           (int) pvstp->totalNumberOfMadeViews);
+   fprintf(stdout," Number of Tasks  =         %12d\n", (int) adcpp->nTasks);
+   fprintf(stdout," Tuples Generated = %20.0f\n",
+           (double) pvstp->totalViewTuples);
+   fprintf(stdout," Tuples/s         = %20.2f\n", 
+           (double) pvstp->totalViewTuples / t_total);
+   fprintf(stdout," Checksum         = %20.12e\n", (double) pvstp->checksum);
+   if (pvstp->verificationFailed)
+      fprintf(stdout, " Verification failed\n");
+
+   c_print_results("DC",
+  		   adcpp->clss,
+  		   (int)adcpp->nInputRecs,
+                   0,
+                   0,
+                   1,
+  		   t_total,
+  		   (double) pvstp->totalViewTuples * 1.e-6 / t_total, 
+  		   "Tuples generated", 
+  		   verified,
+  		   NPBVERSION,
+  		   COMPILETIME,
+  		   CC,
+  		   CLINK,
+  		   C_LIB,
+  		   C_INC,
+  		   CFLAGS,
+  		   CLINKFLAGS); 
+   return ADC_OK;
+}
+
+long long checksumS=464620213;
+long long checksumWlo=434318;
+long long checksumWhi=1401796;
+long long checksumAlo=178042;
+long long checksumAhi=7141688;
+long long checksumBlo=700453;
+long long checksumBhi=9348365;
+
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp){
+  switch(adcpp->clss){
+    case 'S':
+      if(checksum==checksumS) return 0;
+      break;
+    case 'W':
+      if(checksum==checksumWlo+1000000*checksumWhi) return 0;
+      break;
+    case 'A':
+      if(checksum==checksumAlo+1000000*checksumAhi) return 0;
+      break;
+    case 'B':
+      if(checksum==checksumBlo+1000000*checksumBhi) return 0;
+      break;
+    default:
+      return -1; /* CLASS U */
+  }
+  return 1;
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/extbuild.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/extbuild.c
new file mode 100644
index 0000000..3550537
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/extbuild.c
@@ -0,0 +1,988 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+#include "protots.h"
+
+#ifdef UNIX
+#include <errno.h>
+#endif
+
+extern int32 computeChecksum(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+extern int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+
+int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf){
+  uint32 iRec = 0;
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize, ib = 0;
+
+  FSEEK(inpf, 0L, SEEK_SET);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  while(fread(&avp->inpDataBuffer[ib], inpRecSize, 1, inpf)){
+     iRec++;
+     ib += inpBufferPace;      
+  }
+  avp->nRowsToRead = iRec;
+  FSEEK(inpf, 0L, SEEK_SET);
+  
+  if(avp->nInputRecs != iRec){
+     fprintf(stderr, " ReadWholeInputData(): wrong input data reading.\n");
+     return ADC_INTERNAL_ERROR;
+  }  
+  return ADC_OK;
+}
+int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp){
+  uint32 iRec = 0;
+  uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize,ib;
+  uint64 ordern=0;
+#ifdef VIEW_FILE_OUTPUT
+  uint32 retCode;
+#endif
+
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+
+  ib=0;
+  for ( iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      SelectToView( &avp->inpDataBuffer[ib], avp->selection, viewBuf, 
+  		             avp->nd, avp->nm, avp->nv );
+      ib += inpBufferPace;
+      TreeInsert(avp->tree, viewBuf);
+      if(avp->tree->memoryIsFull){
+  	fprintf(stderr, "ComputeMemoryFittedView(): Not enough memory.\n");
+  	return 1; 
+      }
+  }
+
+#ifdef VIEW_FILE_OUTPUT
+  if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern) ){ 
+    fprintf(stderr, "ComputeMemoryFittedView() Write error is occured.\n");
+    return retCode;
+  }
+#else
+  computeChecksum(avp,avp->tree->root.left,&ordern);
+#endif
+ 
+  avp->nViewRows = avp->tree->count;
+  avp->totalOfViewRows += avp->nViewRows; 			      
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+  return ADC_OK;
+}
+
+int32 SharedSortAggregate(ADC_VIEW_CNTL *avp){
+   int32 retCode;
+  uint32 iRec = 0;
+  uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+  uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   int64 chunkOffset = 0;
+   int64 inpfOffset;
+  uint32 nPart = 0;
+  uint32 prevV;
+  uint32 currV;
+  uint32 total = 0;
+  unsigned char *ib;
+  uint32 ibsize = SSA_BUFFER_SIZE;
+  uint32 nib;
+  uint32 iib;
+  uint32 nreg;
+  uint32 nlst;
+  uint32 nsgs;
+  uint32 ncur;
+  uint32 ibOffset = 0;
+  uint64 ordern=0;
+   
+  ib = (unsigned char*) malloc(ibsize); 
+  if (!ib){ 
+    fprintf(stderr,"SharedSortAggregate: memory allocation failed\n"); 
+    return ADC_MEMORY_ALLOCATION_FAILURE; 
+  }
+  
+  nib = ibsize/avp->inpRecSize;
+  nsgs = avp->nRowsToRead/nib;
+  
+  if (nsgs == 0){
+      nreg = avp->nRowsToRead; 
+      nlst = nreg; 
+      nsgs = 1; 
+  }else{
+     nreg = nib;
+     if (avp->nRowsToRead%nib) {
+       nsgs++; 
+       nlst = avp->nRowsToRead%nib;
+     }else{
+       nlst = nreg;			   
+     }
+  }
+  
+  avp->nViewRows = 0; 
+  for( iib = 1; iib <= nsgs; iib++ ){ 
+    if(iib > 1) FSEEK(avp->viewFile, inpfOffset, SEEK_SET);
+    if( iib == nsgs ) ncur = nlst; else ncur = nreg;
+    	  
+    fread(ib, ncur*avp->inpRecSize, 1, avp->viewFile);
+    inpfOffset = ftell(avp->viewFile);
+
+    for( ibOffset = 0, iRec = 1; iRec <= ncur; iRec++ ){
+      memcpy(attrs, &ib[ibOffset], avp->inpRecSize);
+      ibOffset += avp->inpRecSize;
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv); 
+      currV = currBuf[2*avp->nm];
+
+      if(iib == 1 && iRec == 1){ 
+        prevV = currV; 
+        nPart = 1;
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+        TreeInsert(avp->tree, currBuf);
+      }else{
+         if (currV == prevV){
+            nPart++;
+	    TreeInsert (avp->tree, currBuf);
+            if (avp->tree->memoryIsFull){
+	      avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	                                             avp->tree->count;
+	      avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+              (avp->numberOfChunks)++;
+	      if(avp->numberOfChunks >= MAX_NUM_OF_CHUNKS){
+                fprintf(stderr,"Too many chunks were created.\n"); 
+		exit(1);
+              }
+              chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+              retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+	                               avp->tree->root.left, avp->logf);                                       
+              if(retCode!=ADC_OK){
+		fprintf(stderr,"SharedSortAggregate: Write error occured.\n"); 
+		return retCode;
+	      }
+              InitializeTree(avp->tree, avp->nv, avp->nm);
+	    } /* memoryIsFull */
+         }else{
+	   if(avp->numberOfChunks && avp->tree->count!=0){ 
+	     avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	        				     avp->tree->count;
+	     avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+             (avp->numberOfChunks)++;
+             chunkOffset += 
+	    	      (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+	     retCode=WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	   				 avp->tree->root.left, avp->logf);
+             if(retCode!=ADC_OK){
+	       fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+	       return retCode;    
+	      }
+	    }
+            FSEEK(avp->viewFile, 0L, SEEK_END);
+            if(!avp->numberOfChunks){
+               avp->nViewRows += avp->tree->count;
+	       retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern);
+	       if(retCode!=ADC_OK){ 
+	          fprintf(stderr, 
+	        	 "SharedSortAggregate: Write error occured.\n");
+	          return retCode;
+	       }
+ 	     }else{
+	       retCode=MultiWayMerge(avp);
+	       if(retCode!=ADC_OK) {
+	         fprintf(stderr,"SharedSortAggregate.MultiWayMerge: failed.\n");
+	         return retCode;
+	       } 
+	     }
+             InitializeTree(avp->tree, avp->nv, avp->nm);
+             TreeInsert(avp->tree, currBuf);
+             total += nPart;
+             nPart = 1;
+          }
+       }
+       prevV = currV;
+    } /* iRec */
+  } /* iib */
+
+  if(avp->numberOfChunks && avp->tree->count!=0) { 
+    avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+    avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+    (avp->numberOfChunks)++;
+    chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+    retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+    			     avp->tree->root.left, avp->logf);
+    if(retCode!=ADC_OK){
+      fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+      return retCode;	 
+    }
+  }
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  if(!avp->numberOfChunks){
+    avp->nViewRows += avp->tree->count;
+    if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern)){ 
+      fprintf(stderr, "SharedSortAggregate: Write error occured.\n");
+      return retCode;
+    }	 
+  }else{
+     retCode=MultiWayMerge(avp);
+     if(retCode!=ADC_OK) {
+       fprintf(stderr,"SharedSortAggregate.MultiWayMerge failed.\n");
+       return retCode;
+     } 
+  }
+  FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+  
+  total += nPart;
+  avp->totalOfViewRows += avp->nViewRows;
+  if(ib) free(ib);
+  return  ADC_OK;
+}
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof){
+   uint32 i;
+   uint32 iRec = 0;
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 aggrBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 prevBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+    int64 *aggrmp;
+    int64 *currmp;
+    int32 compRes;
+   uint32 nOut = 0; 
+   uint32 mpOffset = 0;
+   uint32 nOutBufRecs;
+   uint32 nViewRows = 0;
+    int64 inpfOffset;
+
+    aggrmp = (int64*) &aggrBuf[0];
+    currmp = (int64*) &currBuf[0];
+    
+    for(i = 0; i < 2*avp->nm+avp->nv; i++){prevBuf[i] = 0; aggrBuf[i] = 0;}
+    nOutBufRecs = avp->memoryLimit/avp->outRecSize;
+
+    for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      fread(attrs, avp->inpRecSize, 1, iof);
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv);
+      if (iRec == 1) memcpy(aggrBuf, currBuf, avp->outRecSize);
+      else{
+       compRes = KeyComp( &currBuf[2*avp->nm], &prevBuf[2*avp->nm], avp->nv);
+
+       switch(compRes){
+	  case  1: 
+	    memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+	    mpOffset += avp->outRecSize;
+	    nOut++;
+	    for ( i = 0; i < avp->nm; i++ ){
+	      avp->mSums[i] += aggrmp[i];
+	      avp->checksums[i] += nOut*aggrmp[i]%measbound;
+	    }    
+	    memcpy(aggrBuf, currBuf, avp->outRecSize);
+	    break;
+	  case  0: 
+	    for ( i = 0; i < avp->nm; i++ ) aggrmp[i] += currmp[i];
+	    break;
+	  case -1: 
+	    fprintf(stderr,"PrefixedAggregate: wrong parent view order.\n"); 
+	    exit(1);
+	    break; 
+	  default: 
+	    fprintf(stderr,"PrefixedAggregate: wrong KeyComp() result.\n"); 
+	    exit(1);
+	    break;
+       }     
+    
+       if (nOut == nOutBufRecs){
+	     inpfOffset = ftell(iof);
+	     FSEEK(iof, 0L, SEEK_END);
+	     WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+	     FSEEK(iof, inpfOffset, SEEK_SET);
+	     mpOffset = 0;
+	     nViewRows += nOut;
+	     nOut = 0; 
+       }
+     }
+     memcpy(prevBuf, currBuf, avp->outRecSize);
+   }
+   memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+   nOut++;
+   for ( i = 0; i < avp->nm; i++ ){
+     avp->mSums[i] += aggrmp[i];
+     avp->checksums[i] += nOut*aggrmp[i]%measbound;
+   }
+   FSEEK(iof, 0L, SEEK_END);
+   WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+   avp->nViewRows	 = nViewRows+nOut;
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+int32 RunFormation (ADC_VIEW_CNTL *avp, FILE *inpf){
+   uint32 iRec = 0;
+   uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+    int64 chunkOffset = 0;
+
+   InitializeTree(avp->tree, avp->nv, avp->nm);
+
+   for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+     fread(attrs, avp->inpRecSize, 1, inpf);
+     SelectToView(attrs, avp->selection, viewBuf, avp->nd, avp->nm, avp->nv); 
+     TreeInsert(avp->tree, viewBuf);
+
+     if(avp->tree->memoryIsFull) {
+        avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+	    avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;		 
+        (avp->numberOfChunks)++;
+	    if (avp->numberOfChunks >= MAX_NUM_OF_CHUNKS) {
+          fprintf(stderr, "RunFormation: Too many chunks were created.\n"); 
+          return ADC_INTERNAL_ERROR;
+        }
+        chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+        if(WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	                         avp->tree->root.left, avp->logf )){
+	       fprintf(stderr, 
+	         "RunFormation.WriteChunkToDisk: Write error is occured.\n");
+	       return ADC_WRITE_FAILED;
+	    }
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+       }
+   } /* Insertion ... */
+   if(avp->numberOfChunks && avp->tree->count!=0) { 
+     avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+     avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;
+     (avp->numberOfChunks)++;
+     chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+     if(WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+                         avp->tree->root.left, avp->logf)){
+       fprintf(stderr, 
+            "RunFormation(.WriteChunkToDisk: Write error is occured.\n");
+       return ADC_WRITE_FAILED;  
+     }
+   }
+   FSEEK(avp->viewFile, 0L, SEEK_END);
+   return ADC_OK;
+}
+void SeekAndReadNextSubChunk( uint32 multiChunkBuffer[], 
+                              uint32 k,
+                              FILE *inFile,
+		              uint32 chunkRecSize, 
+		              uint64 inFileOffs,
+		              uint32 subChunkNum){
+   int64 ret;
+  
+   ret = FSEEK(inFile, inFileOffs, SEEK_SET);
+   if (ret < 0){
+      fprintf(stderr,"SeekAndReadNextSubChunk.fseek() < 0 "); 
+      exit(1); 
+   }
+   fread(&multiChunkBuffer[k], chunkRecSize*subChunkNum, 1, inFile);
+}
+void ReadSubChunk(
+            uint32 chunkRecSize,
+            uint32 *multiChunkBuffer,
+            uint32 mwBufRecSizeInInt,
+            uint32 iChunk,
+            uint32 regSubChunkSize,
+            CHUNKS *chunks,  
+              FILE *fileOfChunks
+            ){
+   if (chunks[iChunk].curChunkNum > 0){
+      if(chunks[iChunk].curChunkNum < regSubChunkSize){
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			(iChunk*regSubChunkSize +
+	   			(regSubChunkSize-chunks[iChunk].curChunkNum))*
+	   			mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			chunks[iChunk].curChunkNum);
+	chunks[iChunk].posSubChunk=regSubChunkSize-chunks[iChunk].curChunkNum;
+	chunks[iChunk].curSubChunk=chunks[iChunk].curChunkNum;
+	chunks[iChunk].curChunkNum=0;
+	chunks[iChunk].chunkOffset=-1;
+      }else{
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			iChunk*regSubChunkSize*mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			regSubChunkSize);
+	chunks[iChunk].posSubChunk = 0;
+	chunks[iChunk].curSubChunk = regSubChunkSize;
+	chunks[iChunk].curChunkNum -= regSubChunkSize;
+	chunks[iChunk].chunkOffset += regSubChunkSize * chunkRecSize;
+      }
+   }
+}
+int32 MultiWayMerge(ADC_VIEW_CNTL *avp){
+   uint32 outputBuffer[OUTPUT_BUFFER_SIZE];
+   uint32 r_buf       [OUTPUT_BUFFER_SIZE];
+   uint32 min_r_buf   [OUTPUT_BUFFER_SIZE];
+   uint32 first_one;
+   uint32 i;
+   uint32 iChunk;
+   uint32 min_r_chunk;
+   uint32 sPos;
+   uint32 iPos;
+   uint32 numEmptyBufs;
+   uint32 numEmptyRuns;
+   uint32 mwBufRecSizeInInt;
+   uint32 chunkRecSize;
+   uint32 *multiChunkBuffer;
+   uint32   regSubChunkSize;
+    int32 compRes;
+    int64 *m_min_r_buf;
+    int64 *m_outputBuffer;
+
+   FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+   first_one = 1;
+   avp->nViewRows  = 0; 
+
+   chunkRecSize = avp->outRecSize;
+   mwBufRecSizeInInt = chunkRecSize/4;
+   m_min_r_buf = (int64*)&min_r_buf[0];
+   m_outputBuffer = (int64*)&outputBuffer[0];
+
+   mwBufRecSizeInInt = chunkRecSize/4;
+   regSubChunkSize = (avp->memoryLimit/avp->numberOfChunks)/chunkRecSize;
+	 
+   if (regSubChunkSize==0) {
+     fprintf(stderr,
+             "MultiWayMerge: Not enough memory to run the external sort\n");
+     return ADC_INTERNAL_ERROR;
+   }
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+
+   for(i = 0; i < avp->numberOfChunks; i++ ){
+      ReadSubChunk( 
+                   chunkRecSize,
+                   multiChunkBuffer,
+                   mwBufRecSizeInInt,
+                   i,
+                   regSubChunkSize,
+                   avp->chunksParams,  
+                   avp->fileOfChunks
+      );
+   }
+   while(1){
+     for(iChunk = 0;iChunk<avp->numberOfChunks;iChunk++){
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+     	sPos = iChunk*regSubChunkSize*mwBufRecSizeInInt;
+    	iPos = sPos+mwBufRecSizeInInt*avp->chunksParams[iChunk].posSubChunk;
+     	memcpy(&min_r_buf[0], &multiChunkBuffer[iPos], avp->outRecSize);
+	    min_r_chunk = iChunk;
+     	break;
+       }
+     }
+     for ( iChunk = min_r_chunk; iChunk < avp->numberOfChunks; iChunk++ ){
+       uint32 iPos;
+
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+          iPos = mwBufRecSizeInInt*(iChunk*regSubChunkSize+
+                                   avp->chunksParams[iChunk].posSubChunk);
+          memcpy(&r_buf[0],&multiChunkBuffer[iPos],avp->outRecSize);
+
+          compRes=KeyComp(&r_buf[2*avp->nm],&min_r_buf[2*avp->nm],avp->nv);	
+          if(compRes < 0) {
+     	      memcpy(&min_r_buf[0], &r_buf[0], avp->outRecSize);
+	          min_r_chunk = iChunk;
+          }
+       }
+     }
+     /* Step forward */
+     if(avp->chunksParams[min_r_chunk].curSubChunk != 0){
+       avp->chunksParams[min_r_chunk].curSubChunk--;
+       avp->chunksParams[min_r_chunk].posSubChunk++;
+     }
+
+       /* Aggreagation if a duplicate is encountered */
+       if(first_one){
+         memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize);
+         first_one = 0;
+       }else{
+         compRes = KeyComp( &outputBuffer[2*avp->nm], 
+        		    &min_r_buf[2*avp->nm], avp->nv );
+         if(!compRes){
+           for(i = 0; i < avp->nm; i++ ){ 
+             m_outputBuffer[i] += m_min_r_buf[i]; 
+           }
+         }else{
+           WriteToFile(outputBuffer,avp->outRecSize,1,avp->viewFile,stderr);
+           avp->nViewRows++;
+           for(i=0;i<avp->nm;i++){
+	     avp->mSums[i]+=m_outputBuffer[i];
+	     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+	   }
+           memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize );
+        }
+      }
+
+      for(numEmptyBufs = 0, 
+          numEmptyRuns = 0, i = 0; i < avp->numberOfChunks; i++ ){
+	     if (avp->chunksParams[i].curSubChunk == 0) numEmptyBufs++;
+         if (avp->chunksParams[i].curChunkNum == 0) numEmptyRuns++;
+      }
+      if(   numEmptyBufs == avp->numberOfChunks 
+          &&numEmptyRuns == avp->numberOfChunks) break;
+
+      if(avp->chunksParams[min_r_chunk].curSubChunk == 0) {
+        ReadSubChunk( 
+        	 chunkRecSize,
+        	 multiChunkBuffer,
+        	 mwBufRecSizeInInt,
+        	 min_r_chunk,
+        	 regSubChunkSize,
+        	 avp->chunksParams,
+        	 avp->fileOfChunks);
+      }
+   } /* while(1) */
+
+   WriteToFile( outputBuffer, avp->outRecSize, 1, avp->viewFile, stderr);	  
+   avp->nViewRows++;
+   for(i = 0; i < avp->nm; i++ ){ 
+     avp->mSums[i] += m_outputBuffer[i]; 
+     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+   }
+
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+void SelectToView( uint32 * ib, uint32 *ix, uint32 *viewBuf, 
+                   uint32 nd, uint32 nm, uint32 nv ){
+   uint32 i, j;
+   for ( j = 0, i = 0; i < nv; i++ ) viewBuf[2*nm+j++] = ib[2*nm+ix[i]-1];
+   memcpy(&viewBuf[0], &ib[0], MSR_FSZ*nm);
+}
+FILE * AdcFileOpen(const char *fileName, const char *mode){
+   FILE *fr;
+   if ((fr = (FILE*) fopen(fileName, mode))==NULL)
+      fprintf(stderr, "AdcFileOpen: Cannot open the file %s errno = %d\n",  
+                       fileName, errno);
+   return fr;
+}
+void AdcFileName(char *adcFileName, const char *adcName, 
+		 const char *fileName, uint32 taskNumber){
+  sprintf(adcFileName, "%s.%s.%d",adcName,fileName,taskNumber);
+}
+ADC_VIEW_CNTL * NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum){
+   ADC_VIEW_CNTL *adccntl;
+   uint32 i, j, k;
+#ifdef IN_CORE
+   uint32 ux;
+#endif
+   char id[8+1];
+   
+   adccntl = (ADC_VIEW_CNTL *) malloc(sizeof(ADC_VIEW_CNTL));
+   if (adccntl==NULL) return NULL;
+   
+   adccntl->ndid = adcpp->ndid;
+   adccntl->taskNumber = pnum;
+   adccntl->retCode = 0;
+   adccntl->swapIt = 0;
+   strcpy(adccntl->adcName, adcpp->adcName);
+   adccntl->nTopDims = adcpp->nd;
+   adccntl->nd = adcpp->nd;
+   adccntl->nm = adcpp->nm;
+   adccntl->nInputRecs = adcpp->nInputRecs;
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->accViewFileOffset = 0;
+   adccntl->totalViewFileSize = 0;
+   adccntl->numberOfMadeViews = 0;
+   adccntl->numberOfViewsMadeFromInput = 0;
+   adccntl->numberOfPrefixedGroupbys = 0;
+   adccntl->numberOfSharedSortGroupbys = 0;
+   adccntl->totalOfViewRows = 0;
+   adccntl->memoryLimit = adcpp->memoryLimit;
+   adccntl->nTasks = adcpp->nTasks;
+   strcpy(adccntl->inpFileName, adcpp->adcInpFileName);
+   sprintf(id, ".%d", adcpp->ndid);
+   
+   AdcFileName(adccntl->adcLogFileName, 
+               adccntl->adcName, "logf", adccntl->taskNumber);
+   strcat(adccntl->adcLogFileName, id);            
+   adccntl->logf = AdcFileOpen(adccntl->adcLogFileName, "w");
+
+   AdcFileName(adccntl->inpFileName, adccntl->adcName, "dat", adcpp->ndid);
+   adccntl->inpf = AdcFileOpen(adccntl->inpFileName, "rb");
+   if(!adccntl->inpf){ 
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE; 
+     return(adccntl);
+   } 
+
+   AdcFileName(adccntl->viewFileName, adccntl->adcName, 
+               "view.dat", adccntl->taskNumber);
+   strcat(adccntl->viewFileName, id);            
+   adccntl->viewFile = AdcFileOpen(adccntl->viewFileName, "wb+");
+
+   AdcFileName(adccntl->chunksFileName, adccntl->adcName, 
+               "chunks.dat", adccntl->taskNumber);
+   strcat(adccntl->chunksFileName, id);            
+   adccntl->fileOfChunks = AdcFileOpen(adccntl->chunksFileName,"wb+");
+
+   AdcFileName(adccntl->groupbyFileName, adccntl->adcName, 
+               "groupby.dat", adccntl->taskNumber);
+   strcat(adccntl->groupbyFileName, id);
+   adccntl->groupbyFile = AdcFileOpen(adccntl->groupbyFileName,"wb+");
+
+   AdcFileName(adccntl->adcViewSizesFileName, adccntl->adcName, 
+               "view.sz", adcpp->ndid);
+   adccntl->adcViewSizesFile = AdcFileOpen(adccntl->adcViewSizesFileName,"r");
+   if(!adccntl->adcViewSizesFile){
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE;
+     return(adccntl);
+   }
+
+   AdcFileName(adccntl->viewSizesFileName, adccntl->adcName, 
+               "viewsz.dat", adccntl->taskNumber);
+   strcat(adccntl->viewSizesFileName, id);            
+   adccntl->viewSizesFile = AdcFileOpen(adccntl->viewSizesFileName, "wb+");
+   
+   adccntl->chunksParams = (CHUNKS*) malloc(MAX_NUM_OF_CHUNKS*sizeof(CHUNKS));
+   if(adccntl->chunksParams==NULL){ 
+     fprintf(adccntl->logf,"NewAdcViewCntl: Cannot allocate 'chunksParsms'\n");
+     adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+     return(adccntl);
+   }
+   adccntl->memPool = (unsigned char*) malloc(adccntl->memoryLimit);
+   if(adccntl->memPool == NULL ){
+      fprintf(adccntl->logf, 
+              "NewAdcViewCntl: Cannot allocate 'main memory pool'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+   
+#ifdef IN_CORE   
+   /* add a condition to allocate this memory buffer, THIS is IMPORTANT */
+   ux = 4*adccntl->nTopDims + 8*adccntl->nm;
+   if (adccntl->nTopDims%8) ux += 4;
+   adccntl->inpDataBuffer = (uint32*) malloc(adccntl->nInputRecs*ux);
+   if(adccntl->inpDataBuffer == NULL ){
+      fprintf(adccntl->logf,
+              "NewAdcViewCntl: Cannot allocate 'input data buffer'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+#endif
+   adccntl->numberOfChunks = 0;
+
+   for ( i = 0; i < adccntl->nm; i++ ){
+     adccntl->mSums[i] = 0;
+     adccntl->checksums[i] = 0;
+     adccntl->totchs[i] = 0;
+  }
+   adccntl->tree = CreateEmptyTree(adccntl->nd, adccntl->nm, 
+                                   adccntl->memoryLimit, adccntl->memPool);
+   if(!adccntl->tree){
+      fprintf(adccntl->logf,"\nNewAdcViewCntl.CreateEmptyTree failed.\n");
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   adccntl->nv = adcpp->nd; /* default */
+   for ( i = 0; i < adccntl->nv; i++ ) adccntl->selection[i]=i+1;
+   
+   adccntl->nViewLimit = (1<<adcpp->nd)-1;
+   adccntl->jpp=(JOB_POOL *) malloc((adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+   if ( adccntl->jpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a job pool.", 
+        (long)(adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE; 
+      return(adccntl);
+   }
+   adccntl->lpp = (LAYER * ) malloc( (adcpp->nd+1)*sizeof(LAYER));
+   if ( adccntl->lpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a layer reference array.", 
+        (long)(adcpp->nd+1)*sizeof(LAYER));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   for ( j = 1, i = 1; i <= adcpp->nd; i++ ) {
+      k =  NumOfCombsFromNbyK ( adcpp->nd, i );
+      adccntl->lpp[i].layerIndex = j;
+      j += k;
+      adccntl->lpp[i].layerQuantityLimit = k;
+      adccntl->lpp[i].layerCurrentPopulation = 0;
+   }    
+      
+   JobPoolInit ( adccntl->jpp, (adccntl->nViewLimit+1), adcpp->nd );
+
+   fprintf(adccntl->logf,"\nMeaning of the log file colums is as follows:\n");
+   fprintf(adccntl->logf,
+     "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+   adccntl->verificationFailed = 1;
+   return adccntl;
+}
+void InitAdcViewCntl(ADC_VIEW_CNTL *adccntl, 
+		     uint32 nSelectedDims, 
+		     uint32 *selection, 
+		     uint32 fromParent ){
+   uint32 i;
+   
+   adccntl->nv = nSelectedDims;
+   
+   for (i = 0; i < adccntl->nm; i++ ) adccntl->mSums[i] = 0;
+   for (i = 0; i < adccntl->nv; i++ ) adccntl->selection[i] = selection[i];
+
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->numberOfChunks = 0;
+   adccntl->fromParent = fromParent;
+   adccntl->nViewRows = 0;
+
+   if(fromParent){
+     adccntl->nd = adccntl->smallestParentLevel;
+     FSEEK(adccntl->viewFile, adccntl->viewOffset, SEEK_SET);
+     adccntl->nRowsToRead = adccntl->nParentViewRows;
+   }else{
+     adccntl->nd = adccntl->nTopDims;
+     adccntl->nRowsToRead = adccntl->nInputRecs;
+   }
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+}
+int32 CloseAdcView(ADC_VIEW_CNTL *adccntl){
+   if (adccntl->inpf) fclose(adccntl->inpf);
+   if (adccntl->viewFile) fclose(adccntl->viewFile);
+   if (adccntl->fileOfChunks) fclose(adccntl->fileOfChunks);
+   if (adccntl->groupbyFile) fclose(adccntl->groupbyFile);
+   if (adccntl->adcViewSizesFile) fclose(adccntl->adcViewSizesFile);
+   if (adccntl->viewSizesFile) fclose(adccntl->viewSizesFile);
+   
+   if (DeleteOneFile(adccntl->chunksFileName))       
+      return ADC_FILE_DELETE_FAILURE;
+   if (DeleteOneFile(adccntl->viewSizesFileName))    
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (DeleteOneFile(adccntl->groupbyFileName))      
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (adccntl->chunksParams){ 
+     free(adccntl->chunksParams); 
+     adccntl->chunksParams=NULL; 
+   }  
+   if (adccntl->memPool){ free(adccntl->memPool); adccntl->memPool=NULL;} 
+   if (adccntl->jpp){ free(adccntl->jpp); adccntl->jpp=NULL; } 
+   if (adccntl->lpp){ free(adccntl->lpp); adccntl->lpp=NULL; } 
+
+   if (adccntl->logf) fclose(adccntl->logf);
+   free(adccntl);
+   return ADC_OK;
+}
+void AdcCntlLog(ADC_VIEW_CNTL *adccntlp){
+  fprintf(adccntlp->logf,"    memoryLimit = %20d\n",
+    adccntlp->memoryLimit);
+  fprintf(adccntlp->logf,"    treeNodeSize = %20d\n",
+    adccntlp->tree->treeNodeSize);
+  fprintf(adccntlp->logf," treeMemoryLimit = %20d\n",
+    adccntlp->tree->memoryLimit);
+  fprintf(adccntlp->logf,"    nNodesLimit = %20d\n",
+    adccntlp->tree->nNodesLimit);
+  fprintf(adccntlp->logf,"freeNodeCounter = %20d\n",
+    adccntlp->tree->freeNodeCounter);
+  fprintf(adccntlp->logf,"	nViewRows = %20d\n",
+    adccntlp->nViewRows);
+}
+int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp){
+     char inps[MAX_PARAM_LINE_SIZE];
+     char msg[64];
+     uint32 *viewCounts;
+     uint32 selection_viewSize[2];
+     uint32 sz;
+     uint32 sel[64];
+     uint32 i;
+     uint32 k;
+     uint64 tx;
+     uint32 iTx; 
+   
+     viewCounts = (uint32 *) &adccntlp->memPool[0];
+     for ( i = 0; i <= adccntlp->nViewLimit; i++) viewCounts[i] = 0;
+     
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     FSEEK(adccntlp->adcViewSizesFile, 0L, SEEK_SET);     
+
+     while(fread(selection_viewSize, 8, 1, adccntlp->viewSizesFile)){
+        viewCounts[selection_viewSize[0]] = selection_viewSize[1];
+     }
+     k = 0;
+     while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps) != EOF ){
+        if ( strcmp(inps, "Selection:") == 0 ) {
+           while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps)) {
+             if ( strcmp(inps, "View") == 0 ) break; 
+             sel[k++] = atoi(inps);	  
+           }
+        }
+        
+        if ( strcmp(inps, "Size:") == 0 ) {
+           fscanf(adccntlp->adcViewSizesFile, "%s", inps);
+           sz = atoi(inps);
+           CreateBinTuple(&tx, sel, k);
+           iTx = (int32)(tx>>(64-adccntlp->nTopDims)); 
+           adccntlp->verificationFailed = 0;
+           if (!adccntlp->numberOfMadeViews) adccntlp->verificationFailed = 1;
+
+           if ( viewCounts[iTx] != 0){
+              if (viewCounts[iTx] != sz) {
+                 if (viewCounts[iTx] != adccntlp->nInputRecs){
+                   fprintf(adccntlp->logf, 
+                           "A view size is wrong: genSz=%d calcSz=%d\n",
+                   	                               sz, viewCounts[iTx]);
+                   adccntlp->verificationFailed = 1;
+                   return ADC_VERIFICATION_FAILED;
+                 }
+              }               
+           }
+           k = 0;
+        }  
+     } /* of while() */
+
+     fprintf(adccntlp->logf,
+       "\n\nMeaning of the log file colums is as follows:\n");
+     fprintf(adccntlp->logf, 
+       "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+     if (!adccntlp->verificationFailed) 
+          strcpy(msg, "Verification=passed");
+     else strcpy(msg, "Verification=failed");
+     FSEEK(adccntlp->logf, 0L, SEEK_SET);
+     fprintf(adccntlp->logf, msg);
+     FSEEK(adccntlp->logf, 0L, SEEK_END);
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     return ADC_OK;
+}
+int32 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntlp){
+    int32 retCode;
+   uint32 i;
+   uint64 binRepTuple;
+   uint32 ut32;
+   uint32 nViews = 0;
+   uint32 nSelectedDims;
+   uint32 smp;
+#ifdef IN_CORE
+   uint32 firstView = 1;
+#endif
+   uint32 selection_viewsize[2];
+   char ttout[16];
+
+   while (fread(&binRepTuple, 8, 1, adccntlp->groupbyFile )){
+     for(i = 0; i < adccntlp->nm; i++) adccntlp->checksums[i]=0;
+     nViews++;
+     swap8(&binRepTuple);
+
+     GetRegTupleFromBin64(binRepTuple, adccntlp->selection,
+                          adccntlp->nTopDims, &nSelectedDims);
+     ut32 = (uint32)(binRepTuple>>(64-adccntlp->nTopDims));
+     selection_viewsize[0] = ut32;
+     ut32 <<= (32-adccntlp->nTopDims);
+     adccntlp->groupby = ut32;
+#ifndef IN_CORE
+     smp = GetParent(adccntlp, ut32);
+#endif
+#ifdef IN_CORE
+     if (firstView) {
+       firstView = 0;
+       if(ReadWholeInputData(adccntlp, adccntlp->inpf)) {
+          fprintf(stderr, "ReadWholeInputData failed.\n");
+          return ADC_INTERNAL_ERROR;   
+       }
+     }
+     smp = noneParent;
+#endif
+
+     if (smp != noneParent)
+     GetRegTupleFromParent(binRepTuple, 
+                           adccntlp->parBinRepTuple, 
+                           adccntlp->selection,
+                           adccntlp->nTopDims);
+     InitAdcViewCntl(adccntlp, nSelectedDims, 
+                     adccntlp->selection, (smp == noneParent)?0:1);
+#ifdef IN_CORE
+      if(retCode = ComputeMemoryFittedView(adccntlp)) {
+         fprintf(stderr, "ComputeMemoryFittedView failed.\n");
+         return retCode;
+      }
+#else
+#ifdef OPTIMIZATION
+     if (smp == prefixedParent){
+        if (retCode = PrefixedAggregate(adccntlp, adccntlp->viewFile)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.PrefixedAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfPrefixedGroupbys++;
+     }else if (smp == sharedSortParent) {
+        if (retCode = SharedSortAggregate(adccntlp)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.SharedSortAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfSharedSortGroupbys++;
+     }else
+#endif /* OPTIMIZATION */     
+     { 
+        if( smp != noneParent ) {
+	  retCode = RunFormation(adccntlp, adccntlp->viewFile);
+          if(retCode!=ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode; 
+            }
+	  }else{
+	    if ((retCode=RunFormation (adccntlp, adccntlp->inpf)) != ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode;
+            }
+	    adccntlp->numberOfViewsMadeFromInput++;
+	  }
+        if(!adccntlp->numberOfChunks){
+          uint64 ordern=0;
+          adccntlp->nViewRows        = adccntlp->tree->count;
+          adccntlp->totalOfViewRows += adccntlp->nViewRows;
+	  retCode=WriteViewToDiskCS(adccntlp,adccntlp->tree->root.left,&ordern);
+	  if(retCode!=ADC_OK){
+            fprintf(stderr,
+	            "ComputeGivenGroupbys.WriteViewToDisk: Write error.\n");
+	    return ADC_WRITE_FAILED;
+	  }
+        }else { 
+          retCode=MultiWayMerge(adccntlp);
+          if(retCode!=ADC_OK) {
+	     fprintf(stderr,"ComputeGivenGroupbys.MultiWayMerge failed.\n");
+	     return retCode;
+	  } 
+        } 
+      }
+     
+     JobPoolUpdate(adccntlp);
+
+     adccntlp->accViewFileOffset += 
+       (int64)(adccntlp->nViewRows*adccntlp->outRecSize);
+     FSEEK(adccntlp->fileOfChunks, 0L, SEEK_SET);
+     FSEEK(adccntlp->inpf, 0L, SEEK_SET);
+#endif /* IN_CORE */
+     for( i = 0; i < adccntlp->nm; i++) 
+       adccntlp->totchs[i]+=adccntlp->checksums[i];
+     selection_viewsize[1] = adccntlp->nViewRows;
+     fwrite(selection_viewsize, 8, 1, adccntlp->viewSizesFile);
+     adccntlp->totalViewFileSize += 
+                            adccntlp->outRecSize*adccntlp->nViewRows;
+     sprintf(ttout, "%7d ", nViews);
+     WriteOne32Tuple(ttout, adccntlp->groupby, 
+                     adccntlp->nTopDims, adccntlp->logf);
+     fprintf(adccntlp->logf, " |  %15d | ", adccntlp->nViewRows); 
+     for ( i = 0; i < adccntlp->nm; i++ ){ 
+        fprintf(adccntlp->logf, " %20lld", adccntlp->checksums[i]);
+     }
+     fprintf(adccntlp->logf, " | %5d", adccntlp->numberOfChunks);
+   }
+   adccntlp->numberOfMadeViews = nViews;  
+   if(ViewSizesVerification(adccntlp)) return ADC_VERIFICATION_FAILED;
+   return ADC_OK;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/jobcntl.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/jobcntl.c
new file mode 100644
index 0000000..8d2e276
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/jobcntl.c
@@ -0,0 +1,562 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+
+#ifdef UNIX
+#include <fcntl.h>
+#include <sys/file.h>
+#include <unistd.h>
+#endif
+
+uint32 NumberOfOnes(uint64 s);
+void swap8(void *a);
+void SetOneBit(uint64 *s, int32 pos){ uint64 ob = MLB; ob >>= pos; *s |= ob;}
+void SetOneBit32(uint32 *s, uint32 pos){ 
+   uint32 ob = 0x80000000;
+   ob >>= pos; 
+   *s |= ob;
+}
+uint32 Mlo32(uint32 x){
+   uint32 om = 0x80000000;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 0, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om >>= 1;
+       k++;
+   } 
+   return(k);   
+}
+int32 mro32(uint32 x){
+   uint32 om = 0x00000001;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 32, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om <<= 1;
+       k--;
+   } 
+   return(k);   
+}
+uint32 setLeadingOnes32(uint32 n){
+    int32 om = 0x80000000;
+   uint32 x;
+   uint32 i;
+         
+   for ( x = 0, i = 0; i < n; i++ ) {
+         x |= om;
+         om >>= 1;
+   } 
+   return (x);
+}
+int32 DeleteOneFile(const char * file_name) {
+#  ifdef WINNT
+      return(remove(file_name));
+#  else
+      return(unlink(file_name));
+#  endif
+}
+void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf) {
+  uint64 ob = MLB32;
+  uint32 i;
+            
+  fprintf(logf, "\n %s", t);
+  for ( i = 0; i < l; i++ ) {
+    if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+    ob >>= 1;
+  }
+}
+uint32 NumOfCombsFromNbyK( uint32 n, uint32 k ){
+  uint32 l, combsNbyK;
+  if ( k > n ) return 0;
+  for(combsNbyK=1, l=1;l<=k;l++)combsNbyK = combsNbyK*(n-l+1)/l;
+  return  combsNbyK;
+}
+void JobPoolUpdate(ADC_VIEW_CNTL *avp){
+   uint32 l = avp->nv;
+   uint32 k;
+  
+   k = avp->lpp[l].layerIndex + avp->lpp[l].layerCurrentPopulation;
+   avp->jpp[k].grpb = avp->groupby;
+   avp->jpp[k].nv = l;
+   avp->jpp[k].nRows = avp->nViewRows;
+   avp->jpp[k].viewOffset = avp->accViewFileOffset;
+   avp->lpp[l].layerCurrentPopulation++;
+} 
+int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 level, levelPop, i;
+   uint32 ig;
+   uint32 igOfSmallestParent;
+   uint32 igOfPrefixedParent;
+   uint32 igOfSharedSortParent;
+   uint32 spMinNumOfRows;
+   uint32 pfMinNumOfRows;
+   uint32 ssMinNumOfRows;
+   uint32 tgrpb;
+   uint32 pg;
+   uint32 pfm;
+   uint32 mlo = 0;
+   uint32 lom;
+   uint32 l = NumberOfOnes(binRepTuple);
+   uint32 spFound;
+   uint32 pfFound;
+   uint32 ssFound;
+   uint32 found;
+   uint32 spFt;
+   uint32 pfFt;   
+   uint32 ssFt;
+
+   found = noneParent;
+   pfm = setLeadingOnes32(mro32(avp->groupby));
+   SetOneBit32(&mlo, Mlo32(avp->groupby));
+   lom = setLeadingOnes32(Mlo32(avp->groupby)); 
+
+   for(spFound=pfFound=ssFound=0, level=l;level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      
+      if(levelPop != 0);
+      {
+           for ( spFt = pfFt = ssFt = 1, ig = avp->lpp[level].layerIndex,
+                 i = 0; i < levelPop; i++ )
+           {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+                  spFound = 1;
+                  if (spFt) { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; spFt = 0; }
+                  else   if ( spMinNumOfRows > avp->jpp[ig].nRows ) 
+                            { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; }
+
+				  pg = tgrpb & pfm;
+				  if (pg == binRepTuple) {
+                     pfFound = 1;
+                     if (pfFt) { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; pfFt = 0; }
+                     else   if ( pfMinNumOfRows > avp->jpp[ig].nRows) 
+                               { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; }
+				  }
+
+				  if ( (tgrpb & mlo) && !(tgrpb & lom)) {
+                     ssFound = 1;
+                     if (ssFt) { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; ssFt = 0; }
+                     else   if ( ssMinNumOfRows > avp->jpp[ig].nRows) 
+                               { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if (pfFound) found = prefixedParent;
+      else if (ssFound) found = sharedSortParent;
+           else if (spFound) found = smallestParent;
+
+      switch(found){
+         case prefixedParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset      = avp->jpp[igOfPrefixedParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfPrefixedParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfPrefixedParent].grpb;
+           break;
+         case sharedSortParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSharedSortParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSharedSortParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSharedSortParent].grpb;
+           break;
+         case smallestParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSmallestParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSmallestParent].grpb;
+           break;
+         default: break;
+      }
+      if(   found == prefixedParent 
+         || found == sharedSortParent 
+	 || found == smallestParent) break;
+   }
+  return found;
+} 
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 l = NumberOfOnes(binRepTuple);
+  
+   for(found=0, level=l; level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      if(levelPop){
+        for(ft=1, ig=avp->lpp[level].layerIndex, i=0;i<levelPop;i++){
+          tgrpb = avp->jpp[ig].grpb;
+          if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+            found = 1;
+            if(ft){
+	      minNumOfRows=avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig; 
+	      ft = 0;
+	    }else if(minNumOfRows > avp->jpp[ig].nRows){ 
+	      minNumOfRows = avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig;
+	    }
+          }
+          ig++;
+        }
+      }
+      if( found ){      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+   return found;
+} 
+int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 pg, tm;
+   uint32 l = NumberOfOnes(binRepTuple);
+   
+   tm = setLeadingOnes32(mro32(avp->groupby));
+
+   for(found=0, level=l; level<=avp->nTopDims; level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+  
+      if (levelPop != 0);
+      {
+           for(ft = 1, ig = avp->lpp[level].layerIndex, 
+                i = 0; i < levelPop; i++ ) {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+				  pg = tgrpb & tm;
+				  if (pg == binRepTuple) {
+                     found = 1;
+                     if (ft) { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; ft = 0; }
+                     else if ( minNumOfRows > avp->jpp[ig].nRows) 
+                             { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if ( found ) {      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+  return found;
+} 
+void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd){
+  uint32 i;
+
+  for ( i = 0; i < n; i++ ) {
+      jpp[i].grpb = 0;
+	  jpp[i].nv = 0;  
+      jpp[i].nRows = 0;
+      jpp[i].viewOffset = 0;
+  }    
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf){
+   uint64 ob = MLB;
+   uint32 i;
+            
+   fprintf(logf, "\n %s", t);
+   for ( i = 0; i < l; i++ ) {
+      if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+      ob >>= 1;
+   }
+}
+uint32 NumberOfOnes(uint64 s){
+   uint64 ob = MLB;
+   uint32 i;
+   uint32 nOnes;
+
+   for ( nOnes = 0, i = 0; i < 64; i++ ) {
+      if (s&ob) nOnes++;
+      ob >>= 1;
+   }
+   return nOnes;
+}
+void GetRegTupleFromBin64(
+           uint64 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint64 oc = MLB;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;  
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void getRegTupleFromBin32(
+           uint32 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint32 oc = MLB32;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void GetRegTupleFromParent(
+               uint64 bin64RepTuple,
+               uint32 bin32RepTuple, 
+	       uint32 *selTuple,
+	       uint32 nd){
+   uint32 oc = MLB32;
+   uint32 i, j, k;
+   uint32 ut32; 
+  
+   ut32 = (uint32)(bin64RepTuple>>(64-nd)); 
+   ut32 <<= (32-nd);
+   
+   for ( j = 0, k = 0, i = 0; i < nd; i++ ) {
+     if (bin32RepTuple & oc) k++;
+     if (bin32RepTuple & oc && ut32 & oc) selTuple[j++] = k; 
+     oc >>= 1;
+   }    
+}
+void CreateBinTuple(uint64 *binRepTuple, uint32 *selTuple, uint32 numDims){
+   uint32 i;
+
+   *binRepTuple = 0;
+   for(i = 0; i < numDims; i++ ){
+     SetOneBit( binRepTuple, selTuple[i]-1 );
+   }    
+}
+void d32v( char * t, uint32 *v, uint32 n){
+   uint32 i;
+   
+   fprintf(stderr,"\n%s ", t);
+   for ( i = 0; i < n; i++ ) fprintf(stderr," %d", v[i]);
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+int32 Comp8gbuf(const void *a, const void *b){
+   if ( a < b ) return -1;
+   else if (a > b) return 1;
+   else return 0;
+}
+void restore(TUPLE_VIEWSIZE x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint64 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].viewsize < x[tj].viewsize) m = tj+1;
+      else m = tj;
+      mm1 = m - 1;
+      jm1 = j - 1;
+      if ( x[mm1].viewsize > x[jm1].viewsize ) {
+         iW = x[mm1].viewsize; 
+	 x[mm1].viewsize = x[jm1].viewsize; 
+	 x[jm1].viewsize = iW;  
+         iW64 = x[mm1].tuple; 
+	 x[mm1].tuple = x[jm1].tuple; 
+	 x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void vszsort( TUPLE_VIEWSIZE x[], uint32 n){
+  int32 i, im1;
+  uint64 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restore( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].viewsize; x[0].viewsize = x[im1].viewsize; x[im1].viewsize = iW;  
+     iW64 = x[0].tuple; x[0].tuple = x[im1].tuple; x[im1].tuple = iW64;  
+     restore( x, 1, im1);
+  }
+}
+uint32 countTupleOnes(uint64 binRepTuple, uint32 numDims){
+  uint32 i, cnt = 0;
+  uint64 ob = 0x0000000000000001; 
+
+  for(i = 0; i < numDims; i++ ){
+    if ( binRepTuple&ob) cnt++;
+    ob <<= 1;
+  }    
+  return cnt;
+}
+void restoreo( TUPLE_ONES x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint32 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].nOnes < x[tj].nOnes) m = tj+1;
+      else m = tj;
+      mm1 = m - 1; jm1 = j - 1;
+      if ( x[mm1].nOnes > x[jm1].nOnes ){
+         iW = x[mm1].nOnes;
+	     x[mm1].nOnes = x[jm1].nOnes; 
+	     x[jm1].nOnes = iW;  
+         iW64 = x[mm1].tuple; 
+	     x[mm1].tuple = x[jm1].tuple; 
+	     x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void onessort( TUPLE_ONES x[], uint32 n){
+   int32 i, im1;
+  uint32 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restoreo( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].nOnes; 
+     x[0].nOnes = x[im1].nOnes; 
+     x[im1].nOnes = iW;  
+     iW64 = x[0].tuple; 
+     x[0].tuple = x[im1].tuple; 
+     x[im1].tuple = iW64;  
+     restoreo( x, 1, im1);
+  }
+}
+uint32 MultiFileProcJobs( TUPLE_VIEWSIZE *tuplesAndSizes, 
+		                          uint32 nViews, 
+                           ADC_VIEW_CNTL *avp ){
+   uint32 i;
+    int32 ii; /* it should be int */
+   uint32 j;
+   uint32 pn;
+   uint32 direction = 0;
+   uint32 dChange = 0;
+   uint32 gbi;
+   uint32 maxn;
+   uint64 *gbuf;
+   uint64      vszs[MAX_NUMBER_OF_TASKS];
+   uint32 nGroupbys[MAX_NUMBER_OF_TASKS];
+   TUPLE_ONES *toptr;
+
+   gbuf = (uint64*) &avp->memPool[0];
+
+   for(i = 0; i < avp->nTasks; i++ ){ nGroupbys[i] = 0; vszs[i] = 0; }
+
+   for(pn = 0, gbi = 0, ii = nViews-1; ii >= 0; ii-- ){
+     if(pn == avp->taskNumber) gbuf[gbi++]=tuplesAndSizes[ii].tuple;
+     nGroupbys[pn]++;
+     vszs[pn] += tuplesAndSizes[ii].viewsize; 
+     if(direction == 0 && pn == avp->nTasks-1 ) { 
+       direction = 1; 
+       dChange = 1; 
+     }
+     if(direction == 1 && pn == 0 ){ 
+       direction = 0; 
+       dChange = 1; 
+     }
+     if (!dChange){ if (direction) pn--; else pn++;}
+     dChange = 0;
+   }
+   for(maxn = 0, i = 0; i < avp->nTasks; i++) 
+     if (nGroupbys[i] > maxn) maxn = nGroupbys[i];
+
+   toptr = (TUPLE_ONES*) malloc(sizeof(TUPLE_ONES)*maxn);
+   if(!toptr) return 1; 
+
+   for(i = 0; i < avp->nTasks; i++ ){
+     if(i == avp->taskNumber){
+       for(j = 0; j < nGroupbys[i]; j++ ){
+         toptr[j].tuple = gbuf[j];
+         toptr[j].nOnes  = countTupleOnes(gbuf[j], avp->nTopDims);
+       }
+       qsort((void*)gbuf,  nGroupbys[i], 8, Comp8gbuf );
+       onessort(toptr, nGroupbys[i]);
+
+       for(j = 0; j < nGroupbys[i]; j++){
+         toptr[nGroupbys[i]-1-j].tuple <<= (64-avp->nTopDims);
+         swap8(&toptr[nGroupbys[i]-1-j].tuple);
+         fwrite(&toptr[nGroupbys[i]-1-j].tuple, 8, 1, avp->groupbyFile);
+       }
+     }
+   }
+   FSEEK(avp->groupbyFile, 0L, SEEK_SET);
+   if (toptr) free(toptr);
+   return 0;
+}
+int32 PartitionCube(ADC_VIEW_CNTL *avp){
+    TUPLE_VIEWSIZE *tuplesAndSizes;
+    uint32 it = 0;
+    uint64 sz;
+    uint32 sel[64];
+    uint32 k;
+    uint64 tx;
+    uint32 i;
+      char inps[256];
+      
+    tuplesAndSizes = 
+       (TUPLE_VIEWSIZE*) malloc(avp->nViewLimit*sizeof(TUPLE_VIEWSIZE));
+    if(tuplesAndSizes == NULL){
+       fprintf(stderr," PartitionCube(): memory allocation failure'\n");
+       return ADC_MEMORY_ALLOCATION_FAILURE;
+    }
+    k = 0;
+    while( fscanf(avp->adcViewSizesFile, "%s", inps) != EOF ){
+       if( strcmp(inps, "Selection:") == 0 ) {
+         while ( fscanf(avp->adcViewSizesFile, "%s", inps)) {
+           if ( strcmp(inps, "View") == 0 ) break; 
+           sel[k++] = atoi(inps);	
+         }
+       }
+       if( strcmp(inps, "Size:") == 0 ){
+         fscanf(avp->adcViewSizesFile, "%s", inps);
+         sz = atoi(inps);
+         CreateBinTuple(&tx, sel, k);
+         if (sz > avp->nInputRecs) sz = avp->nInputRecs;
+         tuplesAndSizes[it].viewsize = sz;
+         tuplesAndSizes[it].tuple = tx; 
+         it++;
+         k = 0;
+       }  
+    }
+    vszsort(tuplesAndSizes, it);
+    for( i = 0; i < it; i++){
+        tuplesAndSizes[i].tuple >>= (64-avp->nTopDims);
+    }
+    if(MultiFileProcJobs( tuplesAndSizes, it, avp )){
+       fprintf(stderr, "MultiFileProcJobs() is failed \n");
+       fprintf(avp->logf, "MultiFileProcJobs() is failed.\n");
+       fflush(avp->logf);
+       return 1;
+    }
+    FSEEK(avp->adcViewSizesFile, 0L, SEEK_SET);
+    free(tuplesAndSizes);
+    return 0;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/macrodef.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/macrodef.h
new file mode 100644
index 0000000..ce67695
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/macrodef.h
@@ -0,0 +1,14 @@
+#define PutErrMsg(msg) {fprintf(stderr," %s, errno = %d\n", msg, errno);}
+
+#define WriteToFile(ptr,size,nitems,stream,logf) if( fwrite(ptr,size,nitems,stream) != nitems )\
+       {\
+        fprintf(stderr,"\n Write error from WriteToFile()\n"); return ADC_WRITE_FAILED; \
+       }
+
+#ifdef WINNT
+#define FSEEK(stream,offset,whence)  fseek(stream, (long)offset,whence);
+#else
+#define FSEEK(stream,offset,whence)  fseek(stream,offset,whence); 
+#endif
+
+#define GetRecSize(nd,nm) (DIM_FSZ*nd+MSR_FSZ*nm)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/protots.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/protots.h
new file mode 100644
index 0000000..6ff92a7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/protots.h
@@ -0,0 +1,100 @@
+ int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf);
+ 
+ int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp);
+
+ int32 MultiWayMerge(ADC_VIEW_CNTL *avp);
+
+ int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+ int32 DeleteOneFile(const char * file_name);
+
+  void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+
+ int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp);
+
+  void CreateBinTuple(
+       uint64  *binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims);
+
+  void AdcCntlLog(ADC_VIEW_CNTL *adccntlp);
+
+  void swap8(void *a);
+
+  void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf);
+
+  void JobPoolUpdate(ADC_VIEW_CNTL *avp);
+
+ int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t);
+
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+  void GetRegTupleFromBin64(
+       uint64   binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims, 
+       uint32  *numOfUnits); 
+
+  void GetRegTupleFromParent(
+       uint64   bin64RepTuple,
+       uint32   bin32RepTuple,
+       uint32  *selTuple,
+       uint32   nd);
+
+  void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd);
+
+uint32 NumOfCombsFromNbyK (uint32 n, uint32 k);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 CheckTree(
+       treeNode  *t , 
+       uint32    *px, 
+       uint32     nv, 
+       uint32     nm, 
+       FILE      *logFile);
+
+ int32 KeyComp(const uint32 *a, const uint32 *b, uint32 n);
+
+ int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+  void SelectToView(
+       uint32  *ib, 
+       uint32  *ix, 
+       uint32  *viewBuf, 
+       uint32   nd, 
+       uint32   nm, 
+       uint32   nv);
+
+ int32 MultiWayBufferSnap(
+       uint32   nv, 
+       uint32   nm,  
+       uint32  *multiChunkBuffer, 
+       uint32	numberOfChunks, 
+       uint32	regSubChunkSize, 
+       uint32	nRecords);
+
+ RBTree *CreateEmptyTree(
+       uint32          nd, 
+       uint32          nm, 
+       uint32          memoryLimit, 
+       unsigned char  *memPool);
+
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof);
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.c
new file mode 100644
index 0000000..ae96e45
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.c
@@ -0,0 +1,240 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "adc.h"
+#include "macrodef.h"
+
+int32 KeyComp( const uint32 *a, const uint32 *b, uint32 n ) {
+  uint32 i;
+  for ( i = 0; i < n; i++ ) {
+    if (a[i] < b[i]) return(-1);
+    else if (a[i] > b[i]) return(1);
+  }
+  return(0);
+}
+int32 TreeInsert(RBTree *tree, uint32 *attrs){
+   uint32  sl = 1;			    	
+   uint32 *attrsP;
+    int32  cmpres;
+ treeNode *xNd, *yNd, *tmp;
+
+  tmp = &tree->root;
+  xNd = tmp->left;
+
+  if (xNd == NULL){
+    tree->count++;
+    NEW_TREE_NODE(tree->mp,tree->memPool,
+        	      tree->memaddr,tree->treeNodeSize,
+        	      tree->freeNodeCounter,tree->memoryIsFull)
+    xNd = tmp->left = tree->mp;
+    memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+    xNd->left = xNd->right = NULL;
+    xNd->clr = BLACK;
+    return 0;
+  }
+
+  tree->drcts[0] = 0;
+  tree->nodes[0] = &tree->root;
+
+  while(1){
+    attrsP = (uint32*) &(xNd->nodeMemPool[tree->nm]);
+    cmpres = KeyComp( &attrs[tree->nm<<1], attrsP, tree->nd );
+
+    if (cmpres < 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 0;
+      yNd = xNd->left;
+
+      if(yNd == NULL){
+	    NEW_TREE_NODE(tree->mp,tree->memPool,
+	  	              tree->memaddr,tree->treeNodeSize,
+	  	              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->left = tree->mp;
+        break;
+      }
+    }else if (cmpres > 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 1;
+      yNd = xNd->right;
+      if(yNd == NULL){
+        NEW_TREE_NODE(tree->mp,tree->memPool,
+		              tree->memaddr,tree->treeNodeSize,
+		              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->right = tree->mp; 
+        break;
+      }
+    }else{  
+      uint64 ii; 
+      int64 *mx;
+      mx = (int64*) &attrs[0];
+      for ( ii = 0; ii < tree->nm; ii++ ) xNd->nodeMemPool[ii] += mx[ii];
+      return 0; 
+    }
+    xNd = yNd;
+  }
+  tree->count++;
+  memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+  xNd->left = xNd->right = NULL;
+  xNd->clr  = RED;
+
+  while(1){
+    if ( tree->nodes[sl-1]->clr != RED || sl<3 ) break;
+      
+    if (tree->drcts[sl-2] == 0){
+      yNd = tree->nodes[sl-2]->right;
+      if (yNd != NULL && yNd->clr == RED){
+        tree->nodes[sl-1]->clr = BLACK;
+        yNd->clr = BLACK;
+        tree->nodes[sl-2]->clr = RED;
+        sl -= 2;
+      }else{
+        if (tree->drcts[sl-1] == 1){
+	      xNd = tree->nodes[sl-1];
+	      yNd = xNd->right;
+	      xNd->right = yNd->left;
+	      yNd->left  = xNd;
+	      tree->nodes[sl-2]->left = yNd;
+        }else
+          yNd = tree->nodes[sl-1];
+	  
+        xNd = tree->nodes[sl-2];
+        xNd->clr = RED;
+        yNd->clr = BLACK;
+
+        xNd->left  = yNd->right;
+        yNd->right = xNd;
+
+        if(tree->drcts[sl-3])
+          tree->nodes[sl-3]->right = yNd;
+	    else  
+          tree->nodes[sl-3]->left = yNd;
+        break;
+      }
+    }else{
+      yNd = tree->nodes[sl-2]->left;
+      if (yNd != NULL && yNd->clr == RED){
+         tree->nodes[sl-1]->clr = BLACK;
+         yNd->clr = BLACK;
+         tree->nodes[sl-2]->clr = RED;
+         sl -= 2;
+      }else{
+    	if(tree->drcts[sl-1] == 0){
+          xNd = tree->nodes[sl-1];
+          yNd = xNd->left;
+          xNd->left  = yNd->right;
+          yNd->right = xNd;
+          tree->nodes[sl-2]->right = yNd;
+   	    }else
+          yNd = tree->nodes[sl-1];
+
+   	    xNd = tree->nodes[sl-2];
+     	xNd->clr = RED;
+    	yNd->clr = BLACK;
+
+    	xNd->right = yNd->left;
+    	yNd->left  = xNd;
+
+   	    if (tree->drcts[sl-3])
+   	      tree->nodes[sl-3]->right = yNd;
+     	else  
+   	      tree->nodes[sl-3]->left  = yNd;
+   	    break;
+      }
+    }
+  }
+  tree->root.left->clr = BLACK;
+  return 0;
+}
+int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDisk( avp, t->left)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDisk( avp, t->right)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDiskCS( avp, t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDiskCS( avp, t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 computeChecksum(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(computeChecksum(avp,t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  if(computeChecksum(avp,t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteChunkToDisk(uint32 recordSize,FILE *fileOfChunks,
+		       treeNode *t, FILE *logFile){   
+  if(!t) return ADC_OK;
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->left, logFile)) 
+    return ADC_WRITE_FAILED; 
+  WriteToFile( t->nodeMemPool, recordSize, 1, fileOfChunks, logFile);
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->right, logFile)) 
+    return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+RBTree * CreateEmptyTree(uint32 nd, uint32 nm, 
+                         uint32 memoryLimit, unsigned char * memPool){
+  RBTree *tree = (RBTree*)  malloc(sizeof(RBTree));
+  if (!tree) return NULL;
+
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryLimit = memoryLimit;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+  tree->memPool = memPool;
+  tree->nodes = (treeNode**) malloc(sizeof(treeNode*)*MAX_TREE_HEIGHT);
+  if (!(tree->nodes)) return NULL;
+  tree->drcts = (uint32*) malloc( sizeof(uint32)*MAX_TREE_HEIGHT);
+  if (!(tree->drcts)) return NULL;
+  return tree;
+}
+void InitializeTree(RBTree *tree, uint32 nd, uint32 nm){
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+}
+int32 DestroyTree(RBTree *tree) {
+  if (tree==NULL) return ADC_TREE_DESTROY_FAILURE;
+  if (tree->memPool!=NULL) free(tree->memPool);
+  if (tree->nodes) free(tree->nodes);
+  if (tree->drcts) free(tree->drcts);
+  free(tree);
+  return ADC_OK;
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.h
new file mode 100644
index 0000000..de4f997
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/DC/rbt.h
@@ -0,0 +1,43 @@
+#ifndef _ADC_PARVIEW_TREE_DEF_H_
+#define _ADC_PARVIEW_TREE_DEF_H_
+
+#define MAX_TREE_HEIGHT	64
+enum{BLACK,RED};
+
+typedef struct treeNode{
+  struct treeNode *left;
+  struct treeNode *right;
+  uint32 clr;
+  int64 nodeMemPool[1];
+} treeNode;
+
+typedef struct RBTree{
+  treeNode root;	
+  treeNode * mp;
+  uint32 count;       
+  uint32 treeNodeSize;
+  uint32 nodeDataSize;
+  uint32 memoryLimit; 
+  uint32 memaddr;
+  uint32 memoryIsFull;
+  uint32 freeNodeCounter;
+  uint32 nNodesLimit;
+  uint32 nd;
+  uint32 nm;
+  uint32   *drcts;
+  treeNode **nodes;
+  unsigned char * memPool;
+} RBTree;
+
+#define NEW_TREE_NODE(node_ptr,memPool,memaddr,treeNodeSize, \
+ freeNodeCounter,memoryIsFull) \
+ node_ptr=(struct treeNode*)(memPool+memaddr); \
+ memaddr+=treeNodeSize; \
+ (freeNodeCounter)--; \
+ if( freeNodeCounter == 0 ) { \
+     memoryIsFull = 1; \
+ }
+
+int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+#endif /* _ADC_PARVIEW_TREE_DEF_H_ */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/Makefile
new file mode 100644
index 0000000..ea73a16
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/Makefile
@@ -0,0 +1,29 @@
+SHELL=/bin/sh
+BENCHMARK=ep
+BENCHMARKU=EP
+
+include ../config/make.def
+
+OBJS = ep.o ${COMMON}/print_results.o ${COMMON}/${RAND}.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+
+ep.o:		ep.f npbparams.h
+ifeq (${HOOKS}, 1)
+		${FCOMPILE} -DHOOKS ep.f
+else
+                ${FCOMPILE} ep.f
+endif
+
+clean:
+	- rm -f *.o *~
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/README
new file mode 100644
index 0000000..0ca487c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/README
@@ -0,0 +1,4 @@
+This code implements the random-number generator described in the
+NAS Parallel Benchmark document RNR Technical Report RNR-94-007.
+The code is "embarrassingly" parallel in that no communication is
+required for the generation of the random numbers itself. 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/ep.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/ep.f
new file mode 100644
index 0000000..7638b5f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/EP/ep.f
@@ -0,0 +1,304 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   E P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB EP code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Author: P. O. Frederickson 
+c         D. H. Bailey
+c         A. C. Woo
+c         H. Jin
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program EMBAR
+c---------------------------------------------------------------------
+C
+c   This is the serial version of the APP Benchmark 1,
+c   the "embarassingly parallel" benchmark.
+c
+c
+c   M is the Log_2 of the number of complex pairs of uniform (0, 1) random
+c   numbers.  MK is the Log_2 of the size of each batch of uniform random
+c   numbers.  MK can be set for convenience on a given system, since it does
+c   not affect the results.
+
+      implicit none
+
+      include 'npbparams.h'
+
+      double precision Mops, epsilon, a, s, t1, t2, t3, t4, x, x1, 
+     >                 x2, q, sx, sy, tm, an, tt, gc, dum(3), qq
+      double precision sx_verify_value, sy_verify_value, sx_err, sy_err
+      integer          mk, mm, nn, nk, nq, np, 
+     >                 i, ik, kk, l, k, nit, 
+     >                 k_offset, j, fstatus
+      logical          verified, timers_enabled
+      external         randlc, timer_read
+      double precision randlc, timer_read
+      character*15     size
+
+      parameter (mk = 16, mm = m - mk, nn = 2 ** mm,
+     >           nk = 2 ** mk, nq = 10, epsilon=1.d-8,
+     >           a = 1220703125.d0, s = 271828183.d0)
+
+      common/storage/ x(2*nk), qq(0:nq-1)
+!$omp threadprivate(/storage/)
+      common/sharedq/ q(0:nq-1)
+
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+      data             dum /1.d0, 1.d0, 1.d0/
+
+      open(unit=2, file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timers_enabled = .true.
+         close(2)
+      else
+         timers_enabled = .false.
+      endif
+
+c   Because the size of the problem is too large to store in a 32-bit
+c   integer for some classes, we put it into a string (for printing).
+c   Have to strip off the decimal point put in there by the floating
+c   point print statement (internal file)
+
+      write(*, 1000)
+      write(size, '(f15.0)' ) 2.d0**(m+1)
+      j = 15
+      if (size(j:j) .eq. '.') j = j - 1
+      write (*,1001) size(1:j)
+!$    write (*,1003) omp_get_max_threads()
+      write (*,*)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - EP Benchmark', /)
+ 1001 format(' Number of random numbers generated: ', a15)
+ 1003 format(' Number of available threads:        ', 2x,i13)
+
+      verified = .false.
+
+c   Compute the number of "batches" of random number pairs generated 
+c   per processor. Adjust if the number of processors does not evenly 
+c   divide the total number
+
+      np = nn 
+
+
+c   Call the random number generator functions and initialize
+c   the x-array to reduce the effects of paging on the timings.
+c   Also, call all mathematical functions that are used. Make
+c   sure these initializations cannot be eliminated as dead code.
+
+      call vranlc(0, dum(1), dum(2), dum(3))
+      dum(1) = randlc(dum(2), dum(3))
+!$omp parallel default(shared) private(i)
+      do 5    i = 1, 2*nk
+         x(i) = -1.d99
+ 5    continue
+!$omp end parallel
+      Mops = log(sqrt(abs(max(1.d0,1.d0))))
+
+
+!$omp parallel
+      call timer_clear(1)
+      if (timers_enabled) call timer_clear(2)
+      if (timers_enabled) call timer_clear(3)
+!$omp end parallel
+
+#ifdef HOOKS
+      call roi_begin
+#endif
+      call timer_start(1)
+
+      t1 = a
+      call vranlc(0, t1, a, x)
+
+c   Compute AN = A ^ (2 * NK) (mod 2^46).
+
+      t1 = a
+
+      do 100 i = 1, mk + 1
+         t2 = randlc(t1, t1)
+ 100  continue
+
+      an = t1
+      tt = s
+      gc = 0.d0
+      sx = 0.d0
+      sy = 0.d0
+
+      do 110 i = 0, nq - 1
+         q(i) = 0.d0
+ 110  continue
+
+c   Each instance of this loop may be performed independently. We compute
+c   the k offsets separately to take into account the fact that some nodes
+c   have more numbers to generate than others
+
+      k_offset = -1
+
+!$omp parallel default(shared)
+!$omp& private(k,kk,t1,t2,t3,t4,i,ik,x1,x2,l)
+      do 115 i = 0, nq - 1
+         qq(i) = 0.d0
+ 115  continue
+
+!$omp do reduction(+:sx,sy)
+      do 150 k = 1, np
+         kk = k_offset + k 
+         t1 = s
+         t2 = an
+
+c        Find starting seed t1 for this kk.
+
+         do 120 i = 1, 100
+            ik = kk / 2
+            if (2 * ik .ne. kk) t3 = randlc(t1, t2)
+            if (ik .eq. 0) goto 130
+            t3 = randlc(t2, t2)
+            kk = ik
+ 120     continue
+
+c        Compute uniform pseudorandom numbers.
+ 130     continue
+
+         if (timers_enabled) call timer_start(3)
+         call vranlc(2 * nk, t1, a, x)
+         if (timers_enabled) call timer_stop(3)
+
+c        Compute Gaussian deviates by acceptance-rejection method and 
+c        tally counts in concentric square annuli.  This loop is not 
+c        vectorizable. 
+
+         if (timers_enabled) call timer_start(2)
+
+         do 140 i = 1, nk
+            x1 = 2.d0 * x(2*i-1) - 1.d0
+            x2 = 2.d0 * x(2*i) - 1.d0
+            t1 = x1 ** 2 + x2 ** 2
+            if (t1 .le. 1.d0) then
+               t2   = sqrt(-2.d0 * log(t1) / t1)
+               t3   = (x1 * t2)
+               t4   = (x2 * t2)
+               l    = max(abs(t3), abs(t4))
+               qq(l) = qq(l) + 1.d0
+               sx   = sx + t3
+               sy   = sy + t4
+            endif
+ 140     continue
+
+         if (timers_enabled) call timer_stop(2)
+
+ 150  continue
+!$omp end do nowait
+
+      do 155 i = 0, nq - 1
+!$omp atomic
+         q(i) = q(i) + qq(i)
+ 155  continue
+!$omp end parallel
+
+      do 160 i = 0, nq - 1
+         gc = gc + q(i)
+ 160  continue
+
+      call timer_stop(1)
+      tm  = timer_read(1)
+
+#ifdef HOOKS
+      call roi_end
+#endif
+
+      nit=0
+      verified = .true.
+      if (m.eq.24) then
+         sx_verify_value = -3.247834652034740D+3
+         sy_verify_value = -6.958407078382297D+3
+      elseif (m.eq.25) then
+         sx_verify_value = -2.863319731645753D+3
+         sy_verify_value = -6.320053679109499D+3
+      elseif (m.eq.28) then
+         sx_verify_value = -4.295875165629892D+3
+         sy_verify_value = -1.580732573678431D+4
+      elseif (m.eq.30) then
+         sx_verify_value =  4.033815542441498D+4
+         sy_verify_value = -2.660669192809235D+4
+      elseif (m.eq.32) then
+         sx_verify_value =  4.764367927995374D+4
+         sy_verify_value = -8.084072988043731D+4
+      elseif (m.eq.36) then
+         sx_verify_value =  1.982481200946593D+5
+         sy_verify_value = -1.020596636361769D+5
+      elseif (m.eq.40) then
+         sx_verify_value = -5.319717441530D+05
+         sy_verify_value = -3.688834557731D+05
+      else
+         verified = .false.
+      endif
+      if (verified) then
+         sx_err = abs((sx - sx_verify_value)/sx_verify_value)
+         sy_err = abs((sy - sy_verify_value)/sy_verify_value)
+         verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon))
+      endif
+      Mops = 2.d0**(m+1)/tm/1000000.d0
+
+      write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1)
+ 11   format ('EP Benchmark Results:'//'CPU Time =',f10.4/'N = 2^',
+     >        i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/
+     >        'Counts:'/(i3,0p,f15.0))
+
+      call print_results('EP', class, m+1, 0, 0, nit,
+     >                   tm, Mops, 
+     >                   'Random numbers generated', 
+     >                   verified, npbversion, compiletime, cs1,
+     >                   cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      if (timers_enabled) then
+         if (tm .le. 0.d0) tm = 1.0
+         tt = timer_read(1)
+         print 810, 'Total time:    ', tt, tt*100./tm
+         tt = timer_read(2)
+         print 810, 'Gaussian pairs:', tt, tt*100./tm
+         tt = timer_read(3)
+         print 810, 'Random numbers:', tt, tt*100./tm
+810      format(1x,a,f9.3,' (',f6.2,'%)')
+      endif
+
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/Makefile
new file mode 100644
index 0000000..996049a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/Makefile
@@ -0,0 +1,32 @@
+SHELL=/bin/sh
+BENCHMARK=ft
+BENCHMARKU=FT
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = ft.o ${COMMON}/${RAND}.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+
+
+.f.o:
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS $<
+else
+	${FCOMPILE} $<
+endif
+
+ft.o:             ft.f  global.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f ft npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/README
new file mode 100644
index 0000000..ab08b36
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/README
@@ -0,0 +1,5 @@
+This code implements the time integration of a three-dimensional
+partial differential equation using the Fast Fourier Transform.
+Some of the dimension statements are not F77 conforming and will
+not work using the g77 compiler. All dimension statements,
+however, are legal F90.
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/ft.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/ft.f
new file mode 100644
index 0000000..0e92077
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/ft.f
@@ -0,0 +1,1113 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   F T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB FT code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: D. Bailey
+c          W. Saphir
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c FT benchmark
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      program ft
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'global.h'
+      integer i
+      
+c---------------------------------------------------------------------
+c u0, u1, u2 are the main arrays in the problem. 
+c Depending on the decomposition, these arrays will have different 
+c dimensions. To accomodate all possibilities, we allocate them as 
+c one-dimensional arrays and pass them to subroutines for different 
+c views
+c  - u0 contains the initial (transformed) initial condition
+c  - u1 and u2 are working arrays
+c  - twiddle contains exponents for the time evolution operator. 
+c---------------------------------------------------------------------
+
+      double complex   u0(ntotalp), 
+     >                 u1(ntotalp)
+c     >                 u2(ntotalp)
+      double precision twiddle(ntotalp)
+c---------------------------------------------------------------------
+c Large arrays are in common so that they are allocated on the
+c heap rather than the stack. This common block is not
+c referenced directly anywhere else. Padding is to avoid accidental 
+c cache problems, since all array sizes are powers of two.
+c---------------------------------------------------------------------
+
+c      double complex pad1(3), pad2(3), pad3(3)
+c      common /bigarrays/ u0, pad1, u1, pad2, u2, pad3, twiddle
+      double complex pad1(3), pad2(3)
+      common /bigarrays/ u0, pad1, u1, pad2, twiddle
+
+      integer iter
+      double precision total_time, mflops
+      logical verified
+      character class
+
+c---------------------------------------------------------------------
+c Run the entire problem once to make sure all data is touched. 
+c This reduces variable startup costs, which is important for such a 
+c short benchmark. The other NPB 2 implementations are similar. 
+c---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+      call setup()
+      call init_ui(u0, u1, twiddle, dims(1), dims(2), dims(3))
+      call compute_indexmap(twiddle, dims(1), dims(2), dims(3))
+      call compute_initial_conditions(u1, dims(1), dims(2), dims(3))
+      call fft_init (dims(1))
+      call fft(1, u1, u0)
+
+c---------------------------------------------------------------------
+c Start over from the beginning. Note that all operations must
+c be timed, in contrast to other benchmarks. 
+c---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+#ifdef HOOKS
+      call roi_begin
+#endif
+
+      call timer_start(T_total)
+      if (timers_enabled) call timer_start(T_setup)
+
+      call compute_indexmap(twiddle, dims(1), dims(2), dims(3))
+
+      call compute_initial_conditions(u1, dims(1), dims(2), dims(3))
+
+      call fft_init (dims(1))
+
+      if (timers_enabled) call timer_stop(T_setup)
+      if (timers_enabled) call timer_start(T_fft)
+      call fft(1, u1, u0)
+      if (timers_enabled) call timer_stop(T_fft)
+
+      do iter = 1, niter
+         if (timers_enabled) call timer_start(T_evolve)
+         call evolve(u0, u1, twiddle, dims(1), dims(2), dims(3))
+         if (timers_enabled) call timer_stop(T_evolve)
+         if (timers_enabled) call timer_start(T_fft)
+c         call fft(-1, u1, u2)
+         call fft(-1, u1, u1)
+         if (timers_enabled) call timer_stop(T_fft)
+         if (timers_enabled) call timer_start(T_checksum)
+c         call checksum(iter, u2, dims(1), dims(2), dims(3))
+         call checksum(iter, u1, dims(1), dims(2), dims(3))
+         if (timers_enabled) call timer_stop(T_checksum)
+      end do
+
+      call verify(nx, ny, nz, niter, verified, class)
+
+      call timer_stop(t_total)
+      total_time = timer_read(t_total)
+
+#ifdef HOOKS
+      call roi_end
+#endif
+
+      if( total_time .ne. 0. ) then
+         mflops = 1.0d-6*float(ntotal) *
+     >             (14.8157+7.19641*log(float(ntotal))
+     >          +  (5.23518+7.21113*log(float(ntotal)))*niter)
+     >                 /total_time
+      else
+         mflops = 0.0
+      endif
+      call print_results('FT', class, nx, ny, nz, niter,
+     >  total_time, mflops, '          floating point', verified, 
+     >  npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      if (timers_enabled) call print_timers()
+
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine init_ui(u0, u1, twiddle, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c touch all the big data
+c---------------------------------------------------------------------
+
+      implicit none
+      integer d1, d2, d3
+      double complex   u0(d1+1,d2,d3)
+      double complex   u1(d1+1,d2,d3)
+      double precision twiddle(d1+1,d2,d3)
+      integer i, j, k
+
+!$omp parallel do default(shared) private(i,j,k)
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = 0.d0
+               u1(i,j,k) = 0.d0
+               twiddle(i,j,k) = 0.d0
+            end do
+         end do
+      end do
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine evolve(u0, u1, twiddle, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c evolve u0 -> u1 (t time steps) in fourier space
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex   u0(d1+1,d2,d3)
+      double complex   u1(d1+1,d2,d3)
+      double precision twiddle(d1+1,d2,d3)
+      integer i, j, k
+
+!$omp parallel do default(shared) private(i,j,k)
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = u0(i,j,k) * twiddle(i,j,k)
+               u1(i,j,k) = u0(i,j,k)
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_initial_conditions(u0, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c Fill in array u0 with initial conditions from 
+c random number generator 
+c---------------------------------------------------------------------
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double complex u0(d1+1, d2, d3)
+      integer k, j
+      double precision x0, start, an, dummy, starts(nz)
+      
+
+      start = seed
+c---------------------------------------------------------------------
+c Jump to the starting element for our first plane.
+c---------------------------------------------------------------------
+      call ipow46(a, 0, an)
+      dummy = randlc(start, an)
+      call ipow46(a, 2*nx*ny, an)
+
+      starts(1) = start
+      do k = 2, dims(3)
+         dummy = randlc(start, an)
+         starts(k) = start
+      end do
+      
+c---------------------------------------------------------------------
+c Go through by z planes filling in one square at a time.
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) private(k,j,x0)
+      do k = 1, dims(3) 
+         x0 = starts(k)
+         do j = 1, dims(2) 
+            call vranlc(2*nx, x0, a, u0(1, j, k))
+         end do
+      end do
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ipow46(a, exponent, result)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute a^exponent mod 2^46
+c---------------------------------------------------------------------
+
+      implicit none
+      double precision a, result, dummy, q, r
+      integer exponent, n, n2
+      external randlc
+      double precision randlc
+c---------------------------------------------------------------------
+c Use
+c   a^n = a^(n/2)*a^(n/2) if n even else
+c   a^n = a*a^(n-1)       if n odd
+c---------------------------------------------------------------------
+      result = 1
+      if (exponent .eq. 0) return
+      q = a
+      r = 1
+      n = exponent
+
+
+      do while (n .gt. 1)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q) 
+            n = n2
+         else
+            dummy = randlc(r, q)
+            n = n-1
+         endif
+      end do
+      dummy = randlc(r, q)
+      result = r
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+
+      integer fstatus
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+      debug = .FALSE.
+
+      open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timers_enabled = .true.
+         close(2)
+      else
+         timers_enabled = .false.
+      endif
+
+      write(*, 1000)
+
+      niter = niter_default
+
+      write(*, 1001) nx, ny, nz
+      write(*, 1002) niter
+!$    write(*, 1003) omp_get_max_threads()
+      write(*, *)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - FT Benchmark', /)
+ 1001 format(' Size                : ', i4, 'x', i4, 'x', i4)
+ 1002 format(' Iterations                  :', i7)
+ 1003 format(' Number of available threads :', i7)
+
+      dims(1) = nx
+      dims(2) = ny
+      dims(3) = nz
+
+
+c---------------------------------------------------------------------
+c Set up info for blocking of ffts and transposes.  This improves
+c performance on cache-based systems. Blocking involves
+c working on a chunk of the problem at a time, taking chunks
+c along the first, second, or third dimension. 
+c
+c - In cffts1 blocking is on 2nd dimension (with fft on 1st dim)
+c - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims)
+
+c Since 1st dim is always in processor, we'll assume it's long enough 
+c (default blocking factor is 16 so min size for 1st dim is 16)
+c The only case we have to worry about is cffts1 in a 2d decomposition. 
+c so the blocking factor should not be larger than the 2nd dimension. 
+c---------------------------------------------------------------------
+
+      fftblock = fftblock_default
+      fftblockpad = fftblockpad_default
+
+      if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3
+
+      return
+      end
+
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_indexmap(twiddle, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2 
+c for time evolution exponent. 
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3
+      double precision twiddle(d1+1, d2, d3)
+      integer i, j, k, kk, kk2, jj, kj2, ii
+      double precision ap
+
+c---------------------------------------------------------------------
+c basically we want to convert the fortran indices 
+c   1 2 3 4 5 6 7 8 
+c to 
+c   0 1 2 3 -4 -3 -2 -1
+c The following magic formula does the trick:
+c mod(i-1+n/2, n) - n/2
+c---------------------------------------------------------------------
+
+      ap = - 4.d0 * alpha * pi *pi
+
+!$omp parallel do default(shared) private(i,j,k,kk,kk2,jj,kj2,ii)
+      do k = 1, dims(3)
+         kk =  mod(k-1+nz/2, nz) - nz/2
+         kk2 = kk*kk
+         do j = 1, dims(2)
+            jj = mod(j-1+ny/2, ny) - ny/2
+            kj2 = jj*jj+kk2
+            do i = 1, dims(1)
+               ii = mod(i-1+nx/2, nx) - nx/2
+               twiddle(i,j,k) = dexp(ap*dble(ii*ii+kj2))
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine print_timers()
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer i
+      include 'global.h'
+      double precision t, t_m
+      character*25 tstrings(T_max)
+      data tstrings / '          total ', 
+     >                '          setup ', 
+     >                '            fft ', 
+     >                '         evolve ', 
+     >                '       checksum ', 
+     >                '           fftx ', 
+     >                '           ffty ', 
+     >                '           fftz ' /
+
+      t_m = timer_read(T_total)
+      if (t_m .le. 0.0d0) t_m = 1.0d0
+      do i = 1, t_max
+         t = timer_read(i)
+         write(*, 100) i, tstrings(i), t, t*100.0/t_m
+      end do
+ 100  format(' timer ', i2, '(', A16,  ') :', F9.4, ' (',F6.2,'%)')
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fft(dir, x1, x2)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer dir
+      double complex x1(ntotalp), x2(ntotalp)
+
+      double complex y1(fftblockpad_default*maxdim),
+     >               y2(fftblockpad_default*maxdim)
+
+c---------------------------------------------------------------------
+c note: args x1, x2 must be different arrays
+c note: args for cfftsx are (direction, layout, xin, xout, scratch)
+c       xin/xout may be the same and it can be somewhat faster
+c       if they are
+c---------------------------------------------------------------------
+
+      if (dir .eq. 1) then
+         call cffts1(1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts2(1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts3(1, dims(1), dims(2), dims(3), x1, x2, y1, y2)
+      else
+         call cffts3(-1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts2(-1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts1(-1, dims(1), dims(2), dims(3), x1, x2, y1, y2)
+      endif
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts1(is, d1, d2, d3, x, xout, y1, y2)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd1
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d1), y2(fftblockpad, d1)
+      integer i, j, k, jj
+
+      logd1 = ilog2(d1)
+
+      if (timers_enabled) call timer_start(T_fftx)
+!$omp parallel do default(shared) private(i,j,k,jj,y1,y2)
+!$omp&  shared(is,logd1,d1)
+      do k = 1, d3
+         do jj = 0, d2 - fftblock, fftblock
+            do j = 1, fftblock
+               do i = 1, d1
+                  y1(j,i) = x(i,j+jj,k)
+               enddo
+            enddo
+            
+            call cfftz (is, logd1, d1, y1, y2)
+
+
+            do j = 1, fftblock
+               do i = 1, d1
+                  xout(i,j+jj,k) = y1(j,i)
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_fftx)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts2(is, d1, d2, d3, x, xout, y1, y2)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd2
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d2), y2(fftblockpad, d2)
+      integer i, j, k, ii
+
+      logd2 = ilog2(d2)
+
+      if (timers_enabled) call timer_start(T_ffty)
+!$omp parallel do default(shared) private(i,j,k,ii,y1,y2)
+!$omp&  shared(is,logd2,d2)
+      do k = 1, d3
+        do ii = 0, d1 - fftblock, fftblock
+           do j = 1, d2
+              do i = 1, fftblock
+                 y1(i,j) = x(i+ii,j,k)
+              enddo
+           enddo
+
+           call cfftz (is, logd2, d2, y1, y2)
+           
+           do j = 1, d2
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y1(i,j)
+              enddo
+           enddo
+        enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_ffty)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cffts3(is, d1, d2, d3, x, xout, y1, y2)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'global.h'
+      integer is, d1, d2, d3, logd3
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d3), y2(fftblockpad, d3)
+      integer i, j, k, ii
+
+      logd3 = ilog2(d3)
+
+      if (timers_enabled) call timer_start(T_fftz)
+!$omp parallel do default(shared) private(i,j,k,ii,y1,y2)
+!$omp&  shared(is)
+      do j = 1, d2
+        do ii = 0, d1 - fftblock, fftblock
+           do k = 1, d3
+              do i = 1, fftblock
+                 y1(i,k) = x(i+ii,j,k)
+              enddo
+           enddo
+
+           call cfftz (is, logd3, d3, y1, y2)
+
+           do k = 1, d3
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y1(i,k)
+              enddo
+           enddo
+        enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_fftz)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fft_init (n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute the roots-of-unity array that will be used for subsequent FFTs. 
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+
+      integer m,n,nu,ku,i,j,ln
+      double precision t, ti
+
+
+c---------------------------------------------------------------------
+c   Initialize the U array with sines and cosines in a manner that permits
+c   stride one access at each FFT iteration.
+c---------------------------------------------------------------------
+      nu = n
+      m = ilog2(n)
+      u(1) = m
+      ku = 2
+      ln = 1
+
+      do j = 1, m
+         t = pi / ln
+         
+         do i = 0, ln - 1
+            ti = i * t
+            u(i+ku) = dcmplx (cos (ti), sin(ti))
+         enddo
+         
+         ku = ku + ln
+         ln = 2 * ln
+      enddo
+      
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine cfftz (is, m, n, x, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   Computes NY N-point complex-to-complex FFTs of X using an algorithm due
+c   to Swarztrauber.  X is both the input and the output array, while Y is a 
+c   scratch array.  It is assumed that N = 2^M.  Before calling CFFTZ to 
+c   perform FFTs, the array U must be initialized by calling CFFTZ with IS 
+c   set to 0 and M set to MX, where MX is the maximum value of M for any 
+c   subsequent call.
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+
+      integer is,m,n,i,j,l,mx
+      double complex x, y
+
+      dimension x(fftblockpad,n), y(fftblockpad,n)
+
+c---------------------------------------------------------------------
+c   Check if input parameters are invalid.
+c---------------------------------------------------------------------
+      mx = u(1)
+      if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx)    
+     >  then
+        write (*, 1)  is, m, mx
+ 1      format ('CFFTZ: Either U has not been initialized, or else'/    
+     >    'one of the input parameters is invalid', 3I5)
+        stop
+      endif
+
+c---------------------------------------------------------------------
+c   Perform one variant of the Stockham FFT.
+c---------------------------------------------------------------------
+      do l = 1, m, 2
+        call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y)
+        if (l .eq. m) goto 160
+        call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x)
+      enddo
+
+      goto 180
+
+c---------------------------------------------------------------------
+c   Copy Y to X.
+c---------------------------------------------------------------------
+ 160  do j = 1, n
+        do i = 1, fftblock
+          x(i,j) = y(i,j)
+        enddo
+      enddo
+
+ 180  continue
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   Performs the L-th iteration of the second variant of the Stockham FFT.
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22
+      double complex u,x,y,u1,x11,x21
+      dimension u(n), x(ny1,n), y(ny1,n)
+
+
+c---------------------------------------------------------------------
+c   Set initial parameters.
+c---------------------------------------------------------------------
+
+      n1 = n / 2
+      lk = 2 ** (l - 1)
+      li = 2 ** (m - l)
+      lj = 2 * lk
+      ku = li + 1
+
+      do i = 0, li - 1
+        i11 = i * lk + 1
+        i12 = i11 + n1
+        i21 = i * lj + 1
+        i22 = i21 + lk
+        if (is .ge. 1) then
+          u1 = u(ku+i)
+        else
+          u1 = dconjg (u(ku+i))
+        endif
+
+c---------------------------------------------------------------------
+c   This loop is vectorizable.
+c---------------------------------------------------------------------
+        do k = 0, lk - 1
+          do j = 1, ny
+            x11 = x(j,i11+k)
+            x21 = x(j,i12+k)
+            y(j,i21+k) = x11 + x21
+            y(j,i22+k) = u1 * (x11 - x21)
+          enddo
+        enddo
+      enddo
+
+      return
+      end
+
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      integer function ilog2(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n, nn, lg
+      if (n .eq. 1) then
+         ilog2=0
+         return
+      endif
+      lg = 1
+      nn = 2
+      do while (nn .lt. n)
+         nn = nn*2
+         lg = lg+1
+      end do
+      ilog2 = lg
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine checksum(i, u1, d1, d2, d3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer i, d1, d2, d3
+      double complex u1(d1+1,d2,d3)
+      integer j, q,r,s
+      double complex chk
+      chk = (0.0,0.0)
+
+!$omp parallel do default(shared) private(i,q,r,s) reduction(+:chk)
+      do j=1,1024
+         q = mod(j, nx)+1
+         r = mod(3*j,ny)+1
+         s = mod(5*j,nz)+1
+         chk=chk+u1(q,r,s)
+      end do
+
+      chk = chk/dble(ntotal)
+      
+      write (*, 30) i, chk
+ 30   format (' T =',I5,5X,'Checksum =',1P2D22.12)
+      sums(i) = chk
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine verify (d1, d2, d3, nt, verified, class)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      include 'global.h'
+      integer d1, d2, d3, nt
+      character class
+      logical verified
+      integer i
+      double precision err, epsilon
+
+c---------------------------------------------------------------------
+c   Reference checksums
+c---------------------------------------------------------------------
+      double complex csum_ref(25)
+
+
+      class = 'U'
+
+      epsilon = 1.0d-12
+      verified = .FALSE.
+
+      if (d1 .eq. 64 .and.
+     >    d2 .eq. 64 .and.
+     >    d3 .eq. 64 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Sample size reference checksums
+c---------------------------------------------------------------------
+         class = 'S'
+         csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02)
+         csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02)
+         csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02)
+         csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02)
+         csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02)
+         csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02)
+
+      else if (d1 .eq. 128 .and.
+     >    d2 .eq. 128 .and.
+     >    d3 .eq. 32 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Class W size reference checksums
+c---------------------------------------------------------------------
+         class = 'W'
+         csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02)
+         csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02)
+         csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02)
+         csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02)
+         csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02)
+         csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02)
+
+      else if (d1 .eq. 256 .and.
+     >    d2 .eq. 256 .and.
+     >    d3 .eq. 128 .and.
+     >    nt .eq. 6) then
+c---------------------------------------------------------------------
+c   Class A size reference checksums
+c---------------------------------------------------------------------
+         class = 'A'
+         csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02)
+         csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02)
+         csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02)
+         csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02)
+         csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02)
+         csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02)
+      
+      else if (d1 .eq. 512 .and.
+     >    d2 .eq. 256 .and.
+     >    d3 .eq. 256 .and.
+     >    nt .eq. 20) then
+c---------------------------------------------------------------------
+c   Class B size reference checksums
+c---------------------------------------------------------------------
+         class = 'B'
+         csum_ref(1)  = dcmplx(5.177643571579D+02, 5.077803458597D+02)
+         csum_ref(2)  = dcmplx(5.154521291263D+02, 5.088249431599D+02)
+         csum_ref(3)  = dcmplx(5.146409228649D+02, 5.096208912659D+02)
+         csum_ref(4)  = dcmplx(5.142378756213D+02, 5.101023387619D+02)
+         csum_ref(5)  = dcmplx(5.139626667737D+02, 5.103976610617D+02)
+         csum_ref(6)  = dcmplx(5.137423460082D+02, 5.105948019802D+02)
+         csum_ref(7)  = dcmplx(5.135547056878D+02, 5.107404165783D+02)
+         csum_ref(8)  = dcmplx(5.133910925466D+02, 5.108576573661D+02)
+         csum_ref(9)  = dcmplx(5.132470705390D+02, 5.109577278523D+02)
+         csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02)
+         csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02)
+         csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02)
+         csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02)
+         csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02)
+         csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02)
+         csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02)
+         csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02)
+         csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02)
+         csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02)
+         csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02)
+
+      else if (d1 .eq. 512 .and.
+     >    d2 .eq. 512 .and.
+     >    d3 .eq. 512 .and.
+     >    nt .eq. 20) then
+c---------------------------------------------------------------------
+c   Class C size reference checksums
+c---------------------------------------------------------------------
+         class = 'C'
+         csum_ref(1)  = dcmplx(5.195078707457D+02, 5.149019699238D+02)
+         csum_ref(2)  = dcmplx(5.155422171134D+02, 5.127578201997D+02)
+         csum_ref(3)  = dcmplx(5.144678022222D+02, 5.122251847514D+02)
+         csum_ref(4)  = dcmplx(5.140150594328D+02, 5.121090289018D+02)
+         csum_ref(5)  = dcmplx(5.137550426810D+02, 5.121143685824D+02)
+         csum_ref(6)  = dcmplx(5.135811056728D+02, 5.121496764568D+02)
+         csum_ref(7)  = dcmplx(5.134569343165D+02, 5.121870921893D+02)
+         csum_ref(8)  = dcmplx(5.133651975661D+02, 5.122193250322D+02)
+         csum_ref(9)  = dcmplx(5.132955192805D+02, 5.122454735794D+02)
+         csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02)
+         csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02)
+         csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02)
+         csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02)
+         csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02)
+         csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02)
+         csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02)
+         csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02)
+         csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02)
+         csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02)
+         csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02)
+
+      else if (d1 .eq. 2048 .and.
+     >    d2 .eq. 1024 .and.
+     >    d3 .eq. 1024 .and.
+     >    nt .eq. 25) then
+c---------------------------------------------------------------------
+c   Class D size reference checksums
+c---------------------------------------------------------------------
+         class = 'D'
+         csum_ref(1)  = dcmplx(5.122230065252D+02, 5.118534037109D+02)
+         csum_ref(2)  = dcmplx(5.120463975765D+02, 5.117061181082D+02)
+         csum_ref(3)  = dcmplx(5.119865766760D+02, 5.117096364601D+02)
+         csum_ref(4)  = dcmplx(5.119518799488D+02, 5.117373863950D+02)
+         csum_ref(5)  = dcmplx(5.119269088223D+02, 5.117680347632D+02)
+         csum_ref(6)  = dcmplx(5.119082416858D+02, 5.117967875532D+02)
+         csum_ref(7)  = dcmplx(5.118943814638D+02, 5.118225281841D+02)
+         csum_ref(8)  = dcmplx(5.118842385057D+02, 5.118451629348D+02)
+         csum_ref(9)  = dcmplx(5.118769435632D+02, 5.118649119387D+02)
+         csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02)
+         csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02)
+         csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02)
+         csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02)
+         csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02)
+         csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02)
+         csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02)
+         csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02)
+         csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02)
+         csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02)
+         csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02)
+         csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02)
+         csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02)
+         csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02)
+         csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02)
+         csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02)
+
+      else if (d1 .eq. 4096 .and.
+     >    d2 .eq. 2048 .and.
+     >    d3 .eq. 2048 .and.
+     >    nt .eq. 25) then
+c---------------------------------------------------------------------
+c   Class E size reference checksums
+c---------------------------------------------------------------------
+         class = 'E'
+         csum_ref(1)  = dcmplx(5.121601045346D+02, 5.117395998266D+02)
+         csum_ref(2)  = dcmplx(5.120905403678D+02, 5.118614716182D+02)
+         csum_ref(3)  = dcmplx(5.120623229306D+02, 5.119074203747D+02)
+         csum_ref(4)  = dcmplx(5.120438418997D+02, 5.119345900733D+02)
+         csum_ref(5)  = dcmplx(5.120311521872D+02, 5.119551325550D+02)
+         csum_ref(6)  = dcmplx(5.120226088809D+02, 5.119720179919D+02)
+         csum_ref(7)  = dcmplx(5.120169296534D+02, 5.119861371665D+02)
+         csum_ref(8)  = dcmplx(5.120131225172D+02, 5.119979364402D+02)
+         csum_ref(9)  = dcmplx(5.120104767108D+02, 5.120077674092D+02)
+         csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02)
+         csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02)
+         csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02)
+         csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02)
+         csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02)
+         csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02)
+         csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02)
+         csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02)
+         csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02)
+         csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02)
+         csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02)
+         csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02)
+         csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02)
+         csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02)
+         csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02)
+         csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02)
+
+      endif
+
+
+      if (class .ne. 'U') then
+
+         do i = 1, nt
+            err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) )
+            if (.not.(err .le. epsilon)) goto 100
+         end do
+         verified = .TRUE.
+ 100     continue
+
+      endif
+
+         
+      if (class .ne. 'U') then
+         if (verified) then
+            write(*,2000)
+ 2000       format(' Result verification successful')
+         else
+            write(*,2001)
+ 2001       format(' Result verification failed')
+         endif
+      endif
+      print *, 'class = ', class
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/global.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/global.h
new file mode 100644
index 0000000..5142859
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/global.h
@@ -0,0 +1,100 @@
+      include 'npbparams.h'
+
+
+c If processor array is 1x1 -> 0D grid decomposition
+
+
+c Cache blocking params. These values are good for most
+c RISC processors.  
+c FFT parameters:
+c  fftblock controls how many ffts are done at a time. 
+c  The default is appropriate for most cache-based machines
+c  On vector machines, the FFT can be vectorized with vector
+c  length equal to the block size, so the block size should
+c  be as large as possible. This is the size of the smallest
+c  dimension of the problem: 128 for class A, 256 for class B and
+c  512 for class C.
+
+      integer fftblock_default, fftblockpad_default
+c      parameter (fftblock_default=16, fftblockpad_default=18)
+      parameter (fftblock_default=32, fftblockpad_default=33)
+      
+      integer fftblock, fftblockpad
+      common /blockinfo/ fftblock, fftblockpad
+
+c we need a bunch of logic to keep track of how
+c arrays are laid out. 
+
+
+c Note: this serial version is the derived from the parallel 0D case
+c of the ft NPB.
+c The computation proceeds logically as
+
+c set up initial conditions
+c fftx(1)
+c transpose (1->2)
+c ffty(2)
+c transpose (2->3)
+c fftz(3)
+c time evolution
+c fftz(3)
+c transpose (3->2)
+c ffty(2)
+c transpose (2->1)
+c fftx(1)
+c compute residual(1)
+
+c for the 0D, 1D, 2D strategies, the layouts look like xxx
+c        
+c            0D        1D        2D
+c 1:        xyz       xyz       xyz
+
+c the array dimensions are stored in dims(coord, phase)
+      integer dims(3)
+      common /layout/ dims
+
+      integer T_total, T_setup, T_fft, T_evolve, T_checksum, 
+     >        T_fftx, T_ffty,
+     >        T_fftz, T_max
+      parameter (T_total = 1, T_setup = 2, T_fft = 3, 
+     >           T_evolve = 4, T_checksum = 5, 
+     >           T_fftx = 6,
+     >           T_ffty = 7,
+     >           T_fftz = 8, T_max = 8)
+
+
+
+      logical timers_enabled
+
+
+      external timer_read
+      double precision timer_read
+      external ilog2
+      integer ilog2
+
+      external randlc
+      double precision randlc
+
+
+c other stuff
+      logical debug, debugsynch
+      common /dbg/ debug, debugsynch, timers_enabled
+
+      double precision seed, a, pi, alpha
+      parameter (seed = 314159265.d0, a = 1220703125.d0, 
+     >  pi = 3.141592653589793238d0, alpha=1.0d-6)
+
+
+c roots of unity array
+c relies on x being largest dimension?
+      double complex u(nxp)
+      common /ucomm/ u
+
+
+c for checksum data
+      double complex sums(0:niter_default)
+      common /sumcomm/ sums
+
+c number of iterations
+      integer niter
+      common /iter/ niter
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/inputft.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/inputft.data.sample
new file mode 100644
index 0000000..448ac42
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/FT/inputft.data.sample
@@ -0,0 +1,3 @@
+6   ! number of iterations
+2   ! layout type. 0 = 0d, 1 = 1d, 2 = 2d
+2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N"
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/Makefile
new file mode 100644
index 0000000..66de367
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/Makefile
@@ -0,0 +1,34 @@
+SHELL=/bin/sh
+BENCHMARK=is
+BENCHMARKU=IS
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = is.o \
+       ${COMMON}/c_print_results.o \
+       ${COMMON}/c_timers.o \
+       ${COMMON}/c_wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+ifeq (${HOOKS}, 1)
+	${CCOMPILE} -DHOOKS $<
+else
+	${CCOMPILE} $<
+endif
+
+is.o:             is.c  npbparams.h
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/README.carefully b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/README.carefully
new file mode 100644
index 0000000..f7dc8f2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/README.carefully
@@ -0,0 +1,49 @@
+Please note:  The IS code in this directory known as is.c is derived
+from the serial version of the NPB2.3 parallel IS.  Although for
+the serial version it is completely unnecessary to have any notion 
+of buckets at all in order to correctly solve the specified NPB1 IS 
+benchmark problem, the buckets seem to be very beneficial in
+parallel versions, including the OpenMP version.
+
+Default setting is
+
+    #define USE_BUCKETS
+
+i.e., buckets turned on!  To switch it off, simply comment out
+the line.
+
+The OpenMP version uses the "dynamic" schedule to improve load
+balance during key sorting.  Sometime, the use of the "static,1"
+(or cyclic) schedule may yield better performance.  Both options
+are acceptable.  The default setting is "dynamic".  To choose
+the cyclic option, define the line:
+
+    #define SCHED_CYCLIC
+
+
+Here some notes inherited from NPB2.3-serial:
+Nevertheless, it is possible to turn on bucketing via #ifdef'ed code.
+Then, the sort first rearranges the keys into buckets by range (the
+bucket's ranges evenly subdivide the total key range), and then
+ranks the contents of each bucket.  This results in key transfers
+first into contiguous elements of buckets.  This is relatively
+cache efficient, since there are a relatively small number of buckets.
+Then the key counting that occurs accesses contiguous array elements.
+Once again, accesses reuse cache lines efficiently.  Finally, the 
+accumulation of key multiplicities (the key count) which gives the key
+ranks also reuses cache line efficiently.
+
+But using the buckets more than doubles the amount of computational
+work that must be performed.  On machines with very large caches, the 
+aforementioned benefits may not exist, and the extra processing looks
+expensive. These examples apply to both CLASS A and B problems:
+
+    SP2-66MhzWN:  50% speedup with buckets                          
+    SGI Indy5000: 50% slowdown with buckets             
+    SGI O2000:   400% slowdown with buckets (Wow!)                
+
+It is a conjecture that cache access is the underlying mechanism 
+causing these variations.
+
+Note: If reporting timing results, either of these modes may be used 
+      without penalty.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/is.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/is.c
new file mode 100644
index 0000000..617bda3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/IS/is.c
@@ -0,0 +1,1066 @@
+/*************************************************************************
+ *                                                                       * 
+ *       N  A  S     P A R A L L E L     B E N C H M A R K S  3.3        *
+ *                                                                       *
+ *                      O p e n M P     V E R S I O N                    *
+ *                                                                       * 
+ *                                  I S                                  * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is an OpenMP version of the NPB IS code.             *
+ *   It is described in NAS Technical Report 99-011.                     *
+ *                                                                       *
+ *   Permission to use, copy, distribute and modify this software        *
+ *   for any purpose with or without fee is hereby granted.  We          *
+ *   request, however, that all derived work reference the NAS           *
+ *   Parallel Benchmarks 3.3. This software is provided "as is"          *
+ *   without express or implied warranty.                                *
+ *                                                                       *
+ *   Information on NPB 3.3, including the technical report, the         *
+ *   original specifications, source code, results and information       *
+ *   on how to submit new results, is available at:                      *
+ *                                                                       *
+ *          http://www.nas.nasa.gov/Software/NPB/                        *
+ *                                                                       *
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   *
+ *                                                                       *
+ *         NAS Parallel Benchmarks Group                                 *
+ *         NASA Ames Research Center                                     *
+ *         Mail Stop: T27A-1                                             *
+ *         Moffett Field, CA   94035-1000                                *
+ *                                                                       *
+ *         E-mail:  npb@nas.nasa.gov                                     *
+ *         Fax:     (650) 604-3957                                       *
+ *                                                                       *
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Yarrow                                                   * 
+ *           H. Jin                                                      * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include "npbparams.h"
+#include <stdlib.h>
+#include <stdio.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+
+/*****************************************************************/
+/* For serial IS, buckets are not really req'd to solve NPB1 IS  */
+/* spec, but their use on some machines improves performance, on */
+/* other machines the use of buckets compromises performance,    */
+/* probably because it is extra computation which is not req'd.  */
+/* (Note: Mechanism not understood, probably cache related)      */
+/* Example:  SP2-66MhzWN:  50% speedup with buckets              */
+/* Example:  SGI Indy5000: 50% slowdown with buckets             */
+/* Example:  SGI O2000:   400% slowdown with buckets (Wow!)      */
+/*****************************************************************/
+/* To disable the use of buckets, comment out the following line */
+#define USE_BUCKETS
+
+/* Uncomment below for cyclic schedule */
+/*#define SCHED_CYCLIC*/
+
+
+/******************/
+/* default values */
+/******************/
+#ifndef CLASS
+#define CLASS 'S'
+#endif
+
+
+/*************/
+/*  CLASS S  */
+/*************/
+#if CLASS == 'S'
+#define  TOTAL_KEYS_LOG_2    16
+#define  MAX_KEY_LOG_2       11
+#define  NUM_BUCKETS_LOG_2   9
+#endif
+
+
+/*************/
+/*  CLASS W  */
+/*************/
+#if CLASS == 'W'
+#define  TOTAL_KEYS_LOG_2    20
+#define  MAX_KEY_LOG_2       16
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+/*************/
+/*  CLASS A  */
+/*************/
+#if CLASS == 'A'
+#define  TOTAL_KEYS_LOG_2    23
+#define  MAX_KEY_LOG_2       19
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS B  */
+/*************/
+#if CLASS == 'B'
+#define  TOTAL_KEYS_LOG_2    25
+#define  MAX_KEY_LOG_2       21
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS C  */
+/*************/
+#if CLASS == 'C'
+#define  TOTAL_KEYS_LOG_2    27
+#define  MAX_KEY_LOG_2       23
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS D  */
+/*************/
+#if CLASS == 'D'
+#define  TOTAL_KEYS_LOG_2    31
+#define  MAX_KEY_LOG_2       27
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+#if CLASS == 'D'
+#define  TOTAL_KEYS          (1L << TOTAL_KEYS_LOG_2)
+#else
+#define  TOTAL_KEYS          (1 << TOTAL_KEYS_LOG_2)
+#endif
+#define  MAX_KEY             (1 << MAX_KEY_LOG_2)
+#define  NUM_BUCKETS         (1 << NUM_BUCKETS_LOG_2)
+#define  NUM_KEYS            TOTAL_KEYS
+#define  SIZE_OF_BUFFERS     NUM_KEYS  
+                                           
+
+#define  MAX_ITERATIONS      10
+#define  TEST_ARRAY_SIZE     5
+
+
+/*************************************/
+/* Typedef: if necessary, change the */
+/* size of int here by changing the  */
+/* int type to, say, long            */
+/*************************************/
+#if CLASS == 'D'
+typedef  long INT_TYPE;
+#else
+typedef  int  INT_TYPE;
+#endif
+
+
+/********************/
+/* Some global info */
+/********************/
+INT_TYPE *key_buff_ptr_global;         /* used by full_verify to get */
+                                       /* copies of rank info        */
+
+int      passed_verification;
+                                 
+
+/************************************/
+/* These are the three main arrays. */
+/* See SIZE_OF_BUFFERS def above    */
+/************************************/
+INT_TYPE key_array[SIZE_OF_BUFFERS],    
+         key_buff1[MAX_KEY],
+         key_buff2[SIZE_OF_BUFFERS],
+         partial_verify_vals[TEST_ARRAY_SIZE],
+         **key_buff1_aptr = NULL;
+
+#ifdef USE_BUCKETS
+INT_TYPE **bucket_size, 
+         bucket_ptrs[NUM_BUCKETS];
+#pragma omp threadprivate(bucket_ptrs)
+#endif
+
+
+/**********************/
+/* Partial verif info */
+/**********************/
+INT_TYPE test_index_array[TEST_ARRAY_SIZE],
+         test_rank_array[TEST_ARRAY_SIZE],
+
+         S_test_index_array[TEST_ARRAY_SIZE] = 
+                             {48427,17148,23627,62548,4431},
+         S_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {0,18,346,64917,65463},
+
+         W_test_index_array[TEST_ARRAY_SIZE] = 
+                             {357773,934767,875723,898999,404505},
+         W_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1249,11698,1039987,1043896,1048018},
+
+         A_test_index_array[TEST_ARRAY_SIZE] = 
+                             {2112377,662041,5336171,3642833,4250760},
+         A_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {104,17523,123928,8288932,8388264},
+
+         B_test_index_array[TEST_ARRAY_SIZE] = 
+                             {41869,812306,5102857,18232239,26860214},
+         B_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {33422937,10244,59149,33135281,99}, 
+
+         C_test_index_array[TEST_ARRAY_SIZE] = 
+                             {44172927,72999161,74326391,129606274,21736814},
+         C_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {61147,882988,266290,133997595,133525895},
+
+         D_test_index_array[TEST_ARRAY_SIZE] = 
+                             {1317351170,995930646,1157283250,1503301535,1453734525},
+         D_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1,36538729,1978098519,2145192618,2147425337};
+
+
+/***********************/
+/* function prototypes */
+/***********************/
+double	randlc( double *X, double *A );
+
+void full_verify( void );
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+
+void    timer_clear( int n );
+void    timer_start( int n );
+void    timer_stop( int n );
+double  timer_read( int n );
+
+void roi_begin_();
+void roi_end_();
+
+/*
+ *    FUNCTION RANDLC (X, A)
+ *
+ *  This routine returns a uniform pseudorandom double precision number in the
+ *  range (0, 1) by using the linear congruential generator
+ *
+ *  x_{k+1} = a x_k  (mod 2^46)
+ *
+ *  where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+ *  before repeating.  The argument A is the same as 'a' in the above formula,
+ *  and X is the same as x_0.  A and X must be odd double precision integers
+ *  in the range (1, 2^46).  The returned value RANDLC is normalized to be
+ *  between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+ *  the new seed x_1, so that subsequent calls to RANDLC using the same
+ *  arguments will generate a continuous sequence.
+ *
+ *  This routine should produce the same results on any computer with at least
+ *  48 mantissa bits in double precision floating point data.  On Cray systems,
+ *  double precision should be disabled.
+ *
+ *  David H. Bailey     October 26, 1990
+ *
+ *     IMPLICIT DOUBLE PRECISION (A-H, O-Z)
+ *     SAVE KS, R23, R46, T23, T46
+ *     DATA KS/0/
+ *
+ *  If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46,
+ *  T23 = 2 ^ 23, and T46 = 2 ^ 46.  These are computed in loops, rather than
+ *  by merely using the ** operator, in order to insure that the results are
+ *  exact on all systems.  This code assumes that 0.5D0 is represented exactly.
+ */
+
+/*****************************************************************/
+/*************           R  A  N  D  L  C             ************/
+/*************                                        ************/
+/*************    portable random number generator    ************/
+/*****************************************************************/
+
+static int      KS=0;
+static double	R23, R46, T23, T46;
+#pragma omp threadprivate(KS, R23, R46, T23, T46)
+
+double	randlc( double *X, double *A )
+{
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
+
+
+
+
+/*****************************************************************/
+/************   F  I  N  D  _  M  Y  _  S  E  E  D    ************/
+/************                                         ************/
+/************ returns parallel random number seq seed ************/
+/*****************************************************************/
+
+/*
+ * Create a random number sequence of total length nn residing
+ * on np number of processors.  Each processor will therefore have a
+ * subsequence of length nn/np.  This routine returns that random
+ * number which is the first random number for the subsequence belonging
+ * to processor rank kn, and which is used as seed for proc kn ran # gen.
+ */
+
+double   find_my_seed( int kn,        /* my processor rank, 0<=kn<=num procs */
+                       int np,        /* np = num procs                      */
+                       long nn,       /* total num of ran numbers, all procs */
+                       double s,      /* Ran num seed, for ex.: 314159265.00 */
+                       double a )     /* Ran num gen mult, try 1220703125.00 */
+{
+
+      double t1,t2;
+      long   mq,nq,kk,ik;
+
+      if ( kn == 0 ) return s;
+
+      mq = (nn/4 + np - 1) / np;
+      nq = mq * 4 * kn;               /* number of rans to be skipped */
+
+      t1 = s;
+      t2 = a;
+      kk = nq;
+      while ( kk > 1 ) {
+      	 ik = kk / 2;
+         if( 2 * ik ==  kk ) {
+            (void)randlc( &t2, &t2 );
+	    kk = ik;
+	 }
+	 else {
+            (void)randlc( &t1, &t2 );
+	    kk = kk - 1;
+	 }
+      }
+      (void)randlc( &t1, &t2 );
+
+      return( t1 );
+
+}
+
+
+
+/*****************************************************************/
+/*************      C  R  E  A  T  E  _  S  E  Q      ************/
+/*****************************************************************/
+
+void	create_seq( double seed, double a )
+{
+	double x, s;
+	INT_TYPE i, k;
+
+#pragma omp parallel private(x,s,i,k)
+    {
+	INT_TYPE k1, k2;
+	double an = a;
+	int myid, num_procs;
+        INT_TYPE mq;
+
+#ifdef _OPENMP
+	myid = omp_get_thread_num();
+	num_procs = omp_get_num_threads();
+#else
+	myid = 0;
+	num_procs = 1;
+#endif
+
+	mq = (NUM_KEYS + num_procs - 1) / num_procs;
+	k1 = mq * myid;
+	k2 = k1 + mq;
+	if ( k2 > NUM_KEYS ) k2 = NUM_KEYS;
+
+	KS = 0;
+	s = find_my_seed( myid, num_procs,
+			  (long)4*NUM_KEYS, seed, an );
+
+        k = MAX_KEY/4;
+
+	for (i=k1; i<k2; i++)
+	{
+	    x = randlc(&s, &an);
+	    x += randlc(&s, &an);
+    	    x += randlc(&s, &an);
+	    x += randlc(&s, &an);  
+
+            key_array[i] = k*x;
+	}
+    } /*omp parallel*/
+}
+
+
+
+/*****************************************************************/
+/*****************    Allocate Working Buffer     ****************/
+/*****************************************************************/
+void *alloc_mem( size_t size )
+{
+    void *p;
+
+    p = (void *)malloc(size);
+    if (!p) {
+        perror("Memory allocation error");
+        exit(1);
+    }
+    return p;
+}
+
+void alloc_key_buff( void )
+{
+    INT_TYPE i;
+    int      num_procs;
+
+
+#ifdef _OPENMP
+    num_procs = omp_get_max_threads();
+#else
+    num_procs = 1;
+#endif
+
+#ifdef USE_BUCKETS
+    bucket_size = (INT_TYPE **)alloc_mem(sizeof(INT_TYPE *) * num_procs);
+
+    for (i = 0; i < num_procs; i++) {
+        bucket_size[i] = (INT_TYPE *)alloc_mem(sizeof(INT_TYPE) * NUM_BUCKETS);
+    }
+
+    #pragma omp parallel for
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff2[i] = 0;
+
+#else /*USE_BUCKETS*/
+
+    key_buff1_aptr = (INT_TYPE **)alloc_mem(sizeof(INT_TYPE *) * num_procs);
+
+    key_buff1_aptr[0] = key_buff1;
+    for (i = 1; i < num_procs; i++) {
+        key_buff1_aptr[i] = (INT_TYPE *)alloc_mem(sizeof(INT_TYPE) * MAX_KEY);
+    }
+
+#endif /*USE_BUCKETS*/
+}
+
+
+
+/*****************************************************************/
+/*************    F  U  L  L  _  V  E  R  I  F  Y     ************/
+/*****************************************************************/
+
+
+void full_verify( void )
+{
+    INT_TYPE   i, j;
+    INT_TYPE   k, k1, k2;
+
+
+/*  Now, finally, sort the keys:  */
+
+/*  Copy keys into work array; keys in key_array will be reassigned. */
+
+#ifdef USE_BUCKETS
+
+    /* Buckets are already sorted.  Sorting keys within each bucket */
+#ifdef SCHED_CYCLIC
+    #pragma omp parallel for private(i,j,k,k1) schedule(static,1)
+#else
+    #pragma omp parallel for private(i,j,k,k1) schedule(dynamic)
+#endif
+    for( j=0; j< NUM_BUCKETS; j++ ) {
+
+        k1 = (j > 0)? bucket_ptrs[j-1] : 0;
+        for ( i = k1; i < bucket_ptrs[j]; i++ ) {
+            k = --key_buff_ptr_global[key_buff2[i]];
+            key_array[k] = key_buff2[i];
+        }
+    }
+
+#else
+
+#pragma omp parallel private(i,j,k,k1,k2)
+  {
+    #pragma omp for
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff2[i] = key_array[i];
+
+    /* This is actual sorting. Each thread is responsible for 
+       a subset of key values */
+    j = omp_get_num_threads();
+    j = (MAX_KEY + j - 1) / j;
+    k1 = j * omp_get_thread_num();
+    k2 = k1 + j;
+    if (k2 > MAX_KEY) k2 = MAX_KEY;
+
+    for( i=0; i<NUM_KEYS; i++ ) {
+        if (key_buff2[i] >= k1 && key_buff2[i] < k2) {
+            k = --key_buff_ptr_global[key_buff2[i]];
+            key_array[k] = key_buff2[i];
+        }
+    }
+  } /*omp parallel*/
+
+#endif
+
+
+/*  Confirm keys correctly sorted: count incorrectly sorted keys, if any */
+
+    j = 0;
+    #pragma omp parallel for reduction(+:j)
+    for( i=1; i<NUM_KEYS; i++ )
+        if( key_array[i-1] > key_array[i] )
+            j++;
+
+    if( j != 0 )
+        printf( "Full_verify: number of keys out of sort: %ld\n", (long)j );
+    else
+        passed_verification++;
+
+}
+
+
+
+
+/*****************************************************************/
+/*************             R  A  N  K             ****************/
+/*****************************************************************/
+
+
+void rank( int iteration )
+{
+
+    INT_TYPE    i, k;
+    INT_TYPE    *key_buff_ptr, *key_buff_ptr2;
+
+#ifdef USE_BUCKETS
+    int shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2;
+    INT_TYPE num_bucket_keys = (1L << shift);
+#endif
+
+
+    key_array[iteration] = iteration;
+    key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration;
+
+
+/*  Determine where the partial verify test keys are, load into  */
+/*  top of array bucket_size                                     */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        partial_verify_vals[i] = key_array[test_index_array[i]];
+
+
+/*  Setup pointers to key buffers  */
+#ifdef USE_BUCKETS
+    key_buff_ptr2 = key_buff2;
+#else
+    key_buff_ptr2 = key_array;
+#endif
+    key_buff_ptr = key_buff1;
+
+
+#pragma omp parallel private(i, k)
+  {
+    INT_TYPE *work_buff, m, k1, k2;
+    int myid = 0, num_procs = 1;
+
+#ifdef _OPENMP
+    myid = omp_get_thread_num();
+    num_procs = omp_get_num_threads();
+#endif
+
+
+/*  Bucket sort is known to improve cache performance on some   */
+/*  cache based systems.  But the actual performance may depend */
+/*  on cache size, problem size. */
+#ifdef USE_BUCKETS
+
+    work_buff = bucket_size[myid];
+
+/*  Initialize */
+    for( i=0; i<NUM_BUCKETS; i++ )  
+        work_buff[i] = 0;
+
+/*  Determine the number of keys in each bucket */
+    #pragma omp for schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )
+        work_buff[key_array[i] >> shift]++;
+
+/*  Accumulative bucket sizes are the bucket pointers.
+    These are global sizes accumulated upon to each bucket */
+    bucket_ptrs[0] = 0;
+    for( k=0; k< myid; k++ )  
+        bucket_ptrs[0] += bucket_size[k][0];
+
+    for( i=1; i< NUM_BUCKETS; i++ ) { 
+        bucket_ptrs[i] = bucket_ptrs[i-1];
+        for( k=0; k< myid; k++ )
+            bucket_ptrs[i] += bucket_size[k][i];
+        for( k=myid; k< num_procs; k++ )
+            bucket_ptrs[i] += bucket_size[k][i-1];
+    }
+
+
+/*  Sort into appropriate bucket */
+    #pragma omp for schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )  
+    {
+        k = key_array[i];
+        key_buff2[bucket_ptrs[k >> shift]++] = k;
+    }
+
+/*  The bucket pointers now point to the final accumulated sizes */
+    if (myid < num_procs-1) {
+        for( i=0; i< NUM_BUCKETS; i++ )
+            for( k=myid+1; k< num_procs; k++ )
+                bucket_ptrs[i] += bucket_size[k][i];
+    }
+
+
+/*  Now, buckets are sorted.  We only need to sort keys inside
+    each bucket, which can be done in parallel.  Because the distribution
+    of the number of keys in the buckets is Gaussian, the use of
+    a dynamic schedule should improve load balance, thus, performance     */
+
+#ifdef SCHED_CYCLIC
+    #pragma omp for schedule(static,1)
+#else
+    #pragma omp for schedule(dynamic)
+#endif
+    for( i=0; i< NUM_BUCKETS; i++ ) {
+
+/*  Clear the work array section associated with each bucket */
+        k1 = i * num_bucket_keys;
+        k2 = k1 + num_bucket_keys;
+        for ( k = k1; k < k2; k++ )
+            key_buff_ptr[k] = 0;
+
+/*  Ranking of all keys occurs in this section:                 */
+
+/*  In this section, the keys themselves are used as their 
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+        m = (i > 0)? bucket_ptrs[i-1] : 0;
+        for ( k = m; k < bucket_ptrs[i]; k++ )
+            key_buff_ptr[key_buff_ptr2[k]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population, not forgetting to add m, the total of lesser keys,
+    to the first key population                                          */
+        key_buff_ptr[k1] += m;
+        for ( k = k1+1; k < k2; k++ )
+            key_buff_ptr[k] += key_buff_ptr[k-1];
+
+    }
+
+#else /*USE_BUCKETS*/
+
+
+    work_buff = key_buff1_aptr[myid];
+
+
+/*  Clear the work array */
+    for( i=0; i<MAX_KEY; i++ )
+        work_buff[i] = 0;
+
+
+/*  Ranking of all keys occurs in this section:                 */
+
+/*  In this section, the keys themselves are used as their 
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+
+    #pragma omp for nowait schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )
+        work_buff[key_buff_ptr2[i]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population                                          */
+
+    for( i=0; i<MAX_KEY-1; i++ )   
+        work_buff[i+1] += work_buff[i];
+
+    #pragma omp barrier
+
+/*  Accumulate the global key population */
+    for( k=1; k<num_procs; k++ ) {
+        #pragma omp for nowait schedule(static)
+        for( i=0; i<MAX_KEY; i++ )
+            key_buff_ptr[i] += key_buff1_aptr[k][i];
+    }
+
+#endif /*USE_BUCKETS*/
+
+  } /*omp parallel*/
+
+/* This is the partial verify test section */
+/* Observe that test_rank_array vals are   */
+/* shifted differently for different cases */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+    {                                             
+        k = partial_verify_vals[i];          /* test vals were put here */
+        if( 0 < k  &&  k <= NUM_KEYS-1 )
+        {
+            INT_TYPE key_rank = key_buff_ptr[k-1];
+            int failed = 0;
+
+            switch( CLASS )
+            {
+                case 'S':
+                    if( i <= 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'W':
+                    if( i < 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+(iteration-2) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'A':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'B':
+                    if( i == 1 || i == 2 || i == 4 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'C':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'D':
+                    if( i < 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+            }
+            if( failed == 1 )
+                printf( "Failed partial verification: "
+                        "iteration %d, test key %d\n", 
+                         iteration, (int)i );
+        }
+    }
+
+
+
+
+/*  Make copies of rank info for use by full_verify: these variables
+    in rank are local; making them global slows down the code, probably
+    since they cannot be made register by compiler                        */
+
+    if( iteration == MAX_ITERATIONS ) 
+        key_buff_ptr_global = key_buff_ptr;
+
+}      
+
+
+/*****************************************************************/
+/*************             M  A  I  N             ****************/
+/*****************************************************************/
+
+int main( int argc, char **argv )
+{
+
+    int             i, iteration, timer_on;
+
+    double          timecounter;
+
+    FILE            *fp;
+
+
+/*  Initialize timers  */
+    timer_on = 0;            
+    if ((fp = fopen("timer.flag", "r")) != NULL) {
+        fclose(fp);
+        timer_on = 1;
+    }
+    timer_clear( 0 );
+    if (timer_on) {
+        timer_clear( 1 );
+        timer_clear( 2 );
+        timer_clear( 3 );
+    }
+
+    if (timer_on) timer_start( 3 );
+
+
+/*  Initialize the verification arrays if a valid class */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        switch( CLASS )
+        {
+            case 'S':
+                test_index_array[i] = S_test_index_array[i];
+                test_rank_array[i]  = S_test_rank_array[i];
+                break;
+            case 'A':
+                test_index_array[i] = A_test_index_array[i];
+                test_rank_array[i]  = A_test_rank_array[i];
+                break;
+            case 'W':
+                test_index_array[i] = W_test_index_array[i];
+                test_rank_array[i]  = W_test_rank_array[i];
+                break;
+            case 'B':
+                test_index_array[i] = B_test_index_array[i];
+                test_rank_array[i]  = B_test_rank_array[i];
+                break;
+            case 'C':
+                test_index_array[i] = C_test_index_array[i];
+                test_rank_array[i]  = C_test_rank_array[i];
+                break;
+            case 'D':
+                test_index_array[i] = D_test_index_array[i];
+                test_rank_array[i]  = D_test_rank_array[i];
+                break;
+        };
+
+        
+
+/*  Printout initial NPB info */
+    printf
+      ( "\n\n NAS Parallel Benchmarks (NPB3.3-OMP) - IS Benchmark\n\n" );
+    printf( " Size:  %ld  (class %c)\n", (long)TOTAL_KEYS, CLASS );
+    printf( " Iterations:  %d\n", MAX_ITERATIONS );
+#ifdef _OPENMP
+    printf( " Number of available threads:  %d\n", omp_get_max_threads() );
+#endif
+    printf( "\n" );
+
+    if (timer_on) timer_start( 1 );
+
+/*  Generate random number sequence and subsequent keys on all procs */
+    create_seq( 314159265.00,                    /* Random number gen seed */
+                1220703125.00 );                 /* Random number gen mult */
+
+    alloc_key_buff();
+    if (timer_on) timer_stop( 1 );
+
+
+/*  Do one interation for free (i.e., untimed) to guarantee initialization of  
+    all data and code pages and respective tables */
+    rank( 1 );  
+
+/*  Start verification counter */
+    passed_verification = 0;
+
+    if( CLASS != 'S' ) printf( "\n   iteration\n" );
+
+/*  Start timer  */             
+    timer_start( 0 );
+
+#ifdef HOOKS
+       roi_begin_();
+#endif
+
+/*  This is the main iteration */
+    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
+    {
+        if( CLASS != 'S' ) printf( "        %d\n", iteration );
+        rank( iteration );
+    }
+
+
+/*  End of timing, obtain maximum time of all processors */
+    timer_stop( 0 );
+    timecounter = timer_read( 0 );
+
+#ifdef HOOKS
+       roi_end_();
+#endif
+
+/*  This tests that keys are in sequence: sorting of last ranked key seq
+    occurs here, but is an untimed operation                             */
+    if (timer_on) timer_start( 2 );
+    full_verify();
+    if (timer_on) timer_stop( 2 );
+
+    if (timer_on) timer_stop( 3 );
+
+
+/*  The final printout  */
+    if( passed_verification != 5*MAX_ITERATIONS + 1 )
+        passed_verification = 0;
+    c_print_results( "IS",
+                     CLASS,
+                     (int)(TOTAL_KEYS/64),
+                     64,
+                     0,
+                     MAX_ITERATIONS,
+                     timecounter,
+                     ((double) (MAX_ITERATIONS*TOTAL_KEYS))
+                                                  /timecounter/1000000.,
+                     "keys ranked", 
+                     passed_verification,
+                     NPBVERSION,
+                     COMPILETIME,
+                     CC,
+                     CLINK,
+                     C_LIB,
+                     C_INC,
+                     CFLAGS,
+                     CLINKFLAGS );
+
+
+/*  Print additional timers  */
+    if (timer_on) {
+       double t_total, t_percent;
+
+       t_total = timer_read( 3 );
+       printf("\nAdditional timers -\n");
+       printf(" Total execution: %8.3f\n", t_total);
+       if (t_total == 0.0) t_total = 1.0;
+       timecounter = timer_read(1);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Initialization : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(0);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Benchmarking   : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(2);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Sorting        : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+    }
+
+    return 0;
+         /**************************/
+}        /*  E N D  P R O G R A M  */
+         /**************************/
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/Makefile
new file mode 100644
index 0000000..80862f9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/Makefile
@@ -0,0 +1,73 @@
+SHELL=/bin/sh
+BENCHMARK=lu
+BENCHMARKU=LU
+VEC=
+
+include ../config/make.def
+
+OBJS = lu.o read_input.o \
+       domain.o setcoeff.o setbv.o exact.o setiv.o \
+       erhs.o ssor$(VEC).o rhs$(VEC).o l2norm.o \
+       jacld.o blts$(VEC).o jacu.o buts$(VEC).o error.o syncs.o \
+       pintgr.o verify.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+
+# npbparams.h is included by applu.incl
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect).
+
+# applu.incl: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o :
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS $<
+else
+	${FCOMPILE} $<
+endif
+
+lu.o:		lu.f applu.incl npbparams.h
+blts$(VEC).o:	blts$(VEC).f
+buts$(VEC).o:	buts$(VEC).f
+erhs.o:		erhs.f applu.incl npbparams.h
+error.o:	error.f applu.incl npbparams.h
+exact.o:	exact.f applu.incl npbparams.h
+jacld.o:	jacld.f applu.incl npbparams.h
+jacu.o:		jacu.f applu.incl npbparams.h
+l2norm.o:	l2norm.f
+pintgr.o:	pintgr.f applu.incl npbparams.h
+read_input.o:	read_input.f applu.incl npbparams.h
+rhs$(VEC).o:	rhs$(VEC).f applu.incl npbparams.h
+setbv.o:	setbv.f applu.incl npbparams.h
+setiv.o:	setiv.f applu.incl npbparams.h
+setcoeff.o:	setcoeff.f applu.incl npbparams.h
+ssor$(VEC).o:	ssor$(VEC).f applu.incl npbparams.h
+domain.o:	domain.f applu.incl npbparams.h
+verify.o:	verify.f applu.incl npbparams.h
+syncs.o:	syncs.f npbparams.h
+
+clean:
+	- /bin/rm -f npbparams.h
+	- /bin/rm -f *.o *~
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/applu.incl b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/applu.incl
new file mode 100644
index 0000000..732791c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/applu.incl
@@ -0,0 +1,163 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c---  applu.incl   
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   npbparams.h defines parameters that depend on the class and 
+c   number of nodes
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c   parameters which can be overridden in runtime config file
+c   isiz1,isiz2,isiz3 give the maximum size
+c   ipr = 1 to print out verbose information
+c   omega = 2.0 is correct for all classes
+c   tolrsd is tolerance levels for steady state residuals
+c---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def, 
+     >                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08, 
+     >          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08, 
+     >          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,
+     >           c3 = 1.00d-01, c4 = 1.00d+00,
+     >           c5 = 1.40d+00 )
+
+c---------------------------------------------------------------------
+c   grid
+c---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+      common/cgcon/ dxi, deta, dzeta,
+     >              tx1, tx2, tx3,
+     >              ty1, ty2, ty3,
+     >              tz1, tz2, tz3,
+     >              nx, ny, nz, 
+     >              nx0, ny0, nz0,
+     >              ist, iend,
+     >              jst, jend,
+     >              ii1, ii2, 
+     >              ji1, ji2, 
+     >              ki1, ki2
+
+c---------------------------------------------------------------------
+c   dissipation
+c---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+      common/disp/ dx1,dx2,dx3,dx4,dx5,
+     >             dy1,dy2,dy3,dy4,dy5,
+     >             dz1,dz2,dz3,dz4,dz5,
+     >             dssp
+
+c---------------------------------------------------------------------
+c   field variables and residuals
+c   to improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.
+c   Note: corresponding array (called "v") in routines blts, buts, 
+c   and l2norm are similarly padded
+c---------------------------------------------------------------------
+      double precision u(5,isiz1/2*2+1,
+     >                     isiz2/2*2+1,
+     >                     isiz3),
+     >                 rsd(5,isiz1/2*2+1,
+     >                       isiz2/2*2+1,
+     >                       isiz3),
+     >                 frct(5,isiz1/2*2+1,
+     >                        isiz2/2*2+1,
+     >                        isiz3),
+     >                 flux(5,isiz1),
+     >                 qs(isiz1/2*2+1,isiz2/2*2+1,isiz3),
+     >                 rho_i(isiz1/2*2+1,isiz2/2*2+1,isiz3)
+
+      common/cvar/ u, rsd, frct, flux,
+     >             qs, rho_i
+
+
+c---------------------------------------------------------------------
+c   output control parameters
+c---------------------------------------------------------------------
+      integer ipr, inorm
+
+      common/cprcon/ ipr, inorm
+
+c---------------------------------------------------------------------
+c   newton-raphson iteration control parameters
+c---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),
+     >        rsdnm(5), errnm(5), frc, ttotal
+
+      common/ctscon/ dt, omega, tolrsd,
+     >               rsdnm, errnm, frc, ttotal,
+     >               itmax, invert
+
+      double precision a(5,5,isiz1/2*2+1,isiz2),
+     >                 b(5,5,isiz1/2*2+1,isiz2),
+     >                 c(5,5,isiz1/2*2+1,isiz2),
+     >                 d(5,5,isiz1/2*2+1,isiz2)
+      double precision au(5,5,isiz1/2*2+1,isiz2),
+     >                 bu(5,5,isiz1/2*2+1,isiz2),
+     >                 cu(5,5,isiz1/2*2+1,isiz2),
+     >                 du(5,5,isiz1/2*2+1,isiz2)
+
+      common/cjac/ a, b, c, d
+      common/cjacu/ au, bu, cu, du
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution
+c---------------------------------------------------------------------
+      double precision ce(5,13)
+
+      common/cexact/ ce
+
+c---------------------------------------------------------------------
+c   timers
+c---------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_rhs,t_jacld,t_blts,
+     >        t_jacu,t_buts,t_add,t_l2norm,t_last,t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_jacld = 6)
+      parameter (t_blts = 7)
+      parameter (t_jacu = 8)
+      parameter (t_buts = 9)
+      parameter (t_add = 10)
+      parameter (t_l2norm = 11)
+      parameter (t_last = 11)
+      logical timeron
+      double precision maxtime
+
+      common/timer/maxtime,timeron
+
+
+c---------------------------------------------------------------------
+c   end of include file
+c---------------------------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts.f
new file mode 100644
index 0000000..1548fa7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts.f
@@ -0,0 +1,258 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v, 
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  nx0, ny0 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz),
+     >        ldz( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldy( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldx( 5, 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+      call sync_left( ldmx, ldmy, ldmz, v )
+
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  tv( m ) =  v( m, i, j, k )
+     > - omega * ( ldy( m, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( m, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( m, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( m, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( m, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( m, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( m, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( m, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( m, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( m, 5, i, j ) * v( 5, i-1, j, k ) )
+
+            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 4 ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = tv( 5 )
+     >                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = tv( 4 )
+     >                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = tv( 3 )
+     >                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = tv( 2 )
+     >                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = tv( 1 )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+!$omp end do nowait
+
+      call sync_right( ldmx, ldmy, ldmz, v )
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts_vec.f
new file mode 100644
index 0000000..e4ee5f4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/blts_vec.f
@@ -0,0 +1,326 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v, 
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  lst, lend )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz),
+     >        ldz( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldy( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldx( 5, 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer lst, lend
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         enddo
+      enddo
+
+
+      do l = lst, lend
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+!$omp do schedule(static)
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+
+                  tv( 1 ) =  v( 1, i, j, k )
+     > - omega * ( ldy( 1, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 1, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 1, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 1, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 1, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 1, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 1, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 1, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 1, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 1, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 2 ) =  v( 2, i, j, k )
+     > - omega * ( ldy( 2, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 2, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 2, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 2, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 2, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 2, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 2, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 2, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 2, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 2, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 3 ) =  v( 3, i, j, k )
+     > - omega * ( ldy( 3, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 3, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 3, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 3, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 3, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 3, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 3, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 3, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 3, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 3, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 4 ) =  v( 4, i, j, k )
+     > - omega * ( ldy( 4, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 4, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 4, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 4, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 4, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 4, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 4, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 4, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 4, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 4, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 5 ) =  v( 5, i, j, k )
+     > - omega * ( ldy( 5, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 5, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 5, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 5, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 5, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 5, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 5, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 5, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 5, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 5, 5, i, j ) * v( 5, i-1, j, k ) )
+
+!            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 4 ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = tv( 5 )
+     >                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = tv( 4 )
+     >                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = tv( 3 )
+     >                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = tv( 2 )
+     >                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = tv( 1 )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts.f
new file mode 100644
index 0000000..1b8c7e6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts.f
@@ -0,0 +1,256 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5,ldmx/2*2+1, ldmy/2*2+1, ldmz), 
+     >        tv( 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy),
+     >        udx( 5, 5, ldmx/2*2+1, ldmy),
+     >        udy( 5, 5, ldmx/2*2+1, ldmy),
+     >        udz( 5, 5, ldmx/2*2+1, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+      call sync_left( ldmx, ldmy, ldmz, v )
+
+!$omp do schedule(static)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp do schedule(static)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = tv( m, i, j )
+     > + omega * ( udy( m, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( m, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( m, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( m, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( m, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( m, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( m, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( m, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( m, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( m, 5, i, j ) * v( 5, i+1, j, k ) )
+            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+        enddo
+      end do
+!$omp end do nowait
+
+      call sync_right( ldmx, ldmy, ldmz, v )
+
+ 
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts_vec.f
new file mode 100644
index 0000000..42e56e4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/buts_vec.f
@@ -0,0 +1,324 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 lst, lend )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5,ldmx/2*2+1, ldmy/2*2+1, ldmz), 
+     >        tv( 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy),
+     >        udx( 5, 5, ldmx/2*2+1, ldmy),
+     >        udy( 5, 5, ldmx/2*2+1, ldmy),
+     >        udz( 5, 5, ldmx/2*2+1, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer lst, lend
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+!$omp do schedule(static)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do l = lend, lst, -1
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+!$omp do schedule(static)
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+                  tv( 1, i, j ) = tv( 1, i, j )
+     > + omega * ( udy( 1, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 1, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 1, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 1, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 1, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 1, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 1, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 1, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 1, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 1, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 2, i, j ) = tv( 2, i, j )
+     > + omega * ( udy( 2, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 2, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 2, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 2, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 2, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 2, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 2, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 2, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 2, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 2, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 3, i, j ) = tv( 3, i, j )
+     > + omega * ( udy( 3, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 3, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 3, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 3, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 3, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 3, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 3, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 3, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 3, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 3, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 4, i, j ) = tv( 4, i, j )
+     > + omega * ( udy( 4, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 4, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 4, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 4, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 4, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 4, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 4, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 4, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 4, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 4, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 5, i, j ) = tv( 5, i, j )
+     > + omega * ( udy( 5, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 5, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 5, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 5, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 5, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 5, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 5, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 5, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 5, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 5, 5, i, j ) * v( 5, i+1, j, k ) )
+!            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+        enddo
+      end do
+
+ 
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/domain.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/domain.f
new file mode 100644
index 0000000..679ac61
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/domain.f
@@ -0,0 +1,68 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine domain
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+
+
+      nx = nx0
+      ny = ny0
+      nz = nz0
+
+c---------------------------------------------------------------------
+c   check the sub-domain size
+c---------------------------------------------------------------------
+      if ( ( nx .lt. 4 ) .or.
+     >     ( ny .lt. 4 ) .or.
+     >     ( nz .lt. 4 ) ) then
+         write (*,2001) nx, ny, nz
+ 2001    format (5x,'SUBDOMAIN SIZE IS TOO SMALL - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE GREATER THAN OR EQUAL',
+     >        /5x,'TO 4 THEY ARE CURRENTLY', 3I3)
+         stop
+      end if
+
+      if ( ( nx .gt. isiz1 ) .or.
+     >     ( ny .gt. isiz2 ) .or.
+     >     ( nz .gt. isiz3 ) ) then
+         write (*,2002) nx, ny, nz
+ 2002    format (5x,'SUBDOMAIN SIZE IS TOO LARGE - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE LESS THAN OR EQUAL TO ',
+     >        /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY.  THEY ARE',
+     >        /5x,'CURRENTLY', 3I4)
+         stop
+      end if
+
+c---------------------------------------------------------------------
+c   set up the start and end in i and j extents for all processors
+c---------------------------------------------------------------------
+      ist = 2
+      iend = nx - 1
+
+      jst = 2
+      jend = ny - 1
+
+      ii1 = 2
+      ii2 = nx0 - 1
+      ji1 = 2
+      ji2 = ny0 - 2
+      ki1 = 3
+      ki2 = nz0 - 1
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/erhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/erhs.f
new file mode 100644
index 0000000..7053ab2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/erhs.f
@@ -0,0 +1,449 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine erhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the right hand side based on exact solution
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+!$omp parallel default(shared) private(i,j,k,m,xi,eta,zeta,tmp,q,flux,
+!$omp&   u51im1,u41im1,u31im1,u21im1,u51i,u41i,u31i,u21i,u21,
+!$omp&   u51jm1,u41jm1,u31jm1,u21jm1,u51j,u41j,u31j,u21j,u31,
+!$omp&   u51km1,u41km1,u31km1,u21km1,u51k,u41k,u31k,u21k,u41)
+!$omp do schedule(static)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  frct( m, i, j, k ) = 0.0d+00
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do schedule(static)
+      do k = 1, nz
+         zeta = ( dble(k-1) ) / ( nz - 1 )
+         do j = 1, ny
+            eta = ( dble(j-1) ) / ( ny0 - 1 )
+            do i = 1, nx
+               xi = ( dble(i-1) ) / ( nx0 - 1 )
+               do m = 1, 5
+                  rsd(m,i,j,k) =  ce(m,1)
+     >                 + (ce(m,2)
+     >                 + (ce(m,5)
+     >                 + (ce(m,8)
+     >                 +  ce(m,11) * xi) * xi) * xi) * xi
+     >                 + (ce(m,3)
+     >                 + (ce(m,6)
+     >                 + (ce(m,9)
+     >                 +  ce(m,12) * eta) * eta) * eta) * eta
+     >                 + (ce(m,4)
+     >                 + (ce(m,7)
+     >                 + (ce(m,10)
+     >                 +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+               end do
+            end do
+         end do
+      end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = rsd(2,i,j,k)
+               u21 = rsd(2,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,i) = rsd(2,i,j,k) * u21 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(3,i) = rsd(3,i,j,k) * u21
+               flux(4,i) = rsd(4,i,j,k) * u21
+               flux(5,i) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                   - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+            do i = ist, nx
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21i = tmp * rsd(2,i,j,k)
+               u31i = tmp * rsd(3,i,j,k)
+               u41i = tmp * rsd(4,i,j,k)
+               u51i = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i-1,j,k)
+
+               u21im1 = tmp * rsd(2,i-1,j,k)
+               u31im1 = tmp * rsd(3,i-1,j,k)
+               u41im1 = tmp * rsd(4,i-1,j,k)
+               u51im1 = tmp * rsd(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * 
+     >                        ( u21i - u21im1 )
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dx1 * tx1 * (            rsd(1,i-1,j,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i+1,j,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            rsd(2,i-1,j,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i+1,j,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            rsd(3,i-1,j,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i+1,j,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >            + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            rsd(4,i-1,j,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i+1,j,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            rsd(5,i-1,j,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,2,j,k) = frct(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * rsd(m,2,j,k)
+     >                       - 4.0d+00 * rsd(m,3,j,k)
+     >                       +           rsd(m,4,j,k) )
+               frct(m,3,j,k) = frct(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * rsd(m,2,j,k)
+     >                       + 6.0d+00 * rsd(m,3,j,k)
+     >                       - 4.0d+00 * rsd(m,4,j,k)
+     >                       +           rsd(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (            rsd(m,i-2,j,k)
+     >                         - 4.0d+00 * rsd(m,i-1,j,k)
+     >                         + 6.0d+00 * rsd(m,i,j,k)
+     >                         - 4.0d+00 * rsd(m,i+1,j,k)
+     >                         +           rsd(m,i+2,j,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,nx-2,j,k) = frct(m,nx-2,j,k)
+     >           - dssp * (             rsd(m,nx-4,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-3,j,k)
+     >                       + 6.0d+00 * rsd(m,nx-2,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-1,j,k)  )
+               frct(m,nx-1,j,k) = frct(m,nx-1,j,k)
+     >           - dssp * (             rsd(m,nx-3,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-2,j,k)
+     >                       + 5.0d+00 * rsd(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = rsd(3,i,j,k)
+               u31 = rsd(3,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,j) = rsd(2,i,j,k) * u31 
+               flux(3,j) = rsd(3,i,j,k) * u31 + c2 * 
+     >                       ( rsd(5,i,j,k) - q )
+               flux(4,j) = rsd(4,i,j,k) * u31
+               flux(5,j) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                 - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21j = tmp * rsd(2,i,j,k)
+               u31j = tmp * rsd(3,i,j,k)
+               u41j = tmp * rsd(4,i,j,k)
+               u51j = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j-1,k)
+
+               u21jm1 = tmp * rsd(2,i,j-1,k)
+               u31jm1 = tmp * rsd(3,i,j-1,k)
+               u41jm1 = tmp * rsd(4,i,j-1,k)
+               u51jm1 = tmp * rsd(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * 
+     >                       ( u31j - u31jm1 )
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dy1 * ty1 * (            rsd(1,i,j-1,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j+1,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            rsd(2,i,j-1,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j+1,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            rsd(3,i,j-1,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j+1,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            rsd(4,i,j-1,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j+1,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            rsd(5,i,j-1,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j+1,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,2,k) = frct(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * rsd(m,i,2,k)
+     >                       - 4.0d+00 * rsd(m,i,3,k)
+     >                       +           rsd(m,i,4,k) )
+               frct(m,i,3,k) = frct(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * rsd(m,i,2,k)
+     >                       + 6.0d+00 * rsd(m,i,3,k)
+     >                       - 4.0d+00 * rsd(m,i,4,k)
+     >                       +           rsd(m,i,5,k) )
+            end do
+
+            do j = 4, ny - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (            rsd(m,i,j-2,k)
+     >                        - 4.0d+00 * rsd(m,i,j-1,k)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j+1,k)
+     >                        +           rsd(m,i,j+2,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,ny-2,k) = frct(m,i,ny-2,k)
+     >           - dssp * (             rsd(m,i,ny-4,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-3,k)
+     >                       + 6.0d+00 * rsd(m,i,ny-2,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-1,k)  )
+               frct(m,i,ny-1,k) = frct(m,i,ny-1,k)
+     >           - dssp * (             rsd(m,i,ny-3,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-2,k)
+     >                       + 5.0d+00 * rsd(m,i,ny-1,k)  )
+            end do
+
+         end do
+      end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               flux(1,k) = rsd(4,i,j,k)
+               u41 = rsd(4,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,k) = rsd(2,i,j,k) * u41 
+               flux(3,k) = rsd(3,i,j,k) * u41 
+               flux(4,k) = rsd(4,i,j,k) * u41 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(5,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                  - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21k = tmp * rsd(2,i,j,k)
+               u31k = tmp * rsd(3,i,j,k)
+               u41k = tmp * rsd(4,i,j,k)
+               u51k = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j,k-1)
+
+               u21km1 = tmp * rsd(2,i,j,k-1)
+               u31km1 = tmp * rsd(3,i,j,k-1)
+               u41km1 = tmp * rsd(4,i,j,k-1)
+               u51km1 = tmp * rsd(5,i,j,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * ( u41k 
+     >                       - u41km1 )
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dz1 * tz1 * (            rsd(1,i,j,k+1)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j,k-1) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            rsd(2,i,j,k+1)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j,k-1) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            rsd(3,i,j,k+1)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j,k-1) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            rsd(4,i,j,k+1)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j,k-1) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            rsd(5,i,j,k+1)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j,k-1) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,j,2) = frct(m,i,j,2)
+     >           - dssp * ( + 5.0d+00 * rsd(m,i,j,2)
+     >                       - 4.0d+00 * rsd(m,i,j,3)
+     >                       +           rsd(m,i,j,4) )
+               frct(m,i,j,3) = frct(m,i,j,3)
+     >           - dssp * (- 4.0d+00 * rsd(m,i,j,2)
+     >                      + 6.0d+00 * rsd(m,i,j,3)
+     >                      - 4.0d+00 * rsd(m,i,j,4)
+     >                      +           rsd(m,i,j,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (           rsd(m,i,j,k-2)
+     >                        - 4.0d+00 * rsd(m,i,j,k-1)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j,k+1)
+     >                        +           rsd(m,i,j,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,j,nz-2) = frct(m,i,j,nz-2)
+     >           - dssp * (            rsd(m,i,j,nz-4)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-3)
+     >                      + 6.0d+00 * rsd(m,i,j,nz-2)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-1)  )
+               frct(m,i,j,nz-1) = frct(m,i,j,nz-1)
+     >           - dssp * (             rsd(m,i,j,nz-3)
+     >                       - 4.0d+00 * rsd(m,i,j,nz-2)
+     >                       + 5.0d+00 * rsd(m,i,j,nz-1)  )
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/error.f
new file mode 100644
index 0000000..a0b713e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/error.f
@@ -0,0 +1,73 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the solution error
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  tmp
+      double precision  u000ijk(5)
+      double precision  errnm_local(5)
+
+
+      do m = 1, 5
+         errnm(m) = 0.0d+00
+      end do
+
+!$omp parallel default(shared) private(i,j,k,m,tmp,u000ijk,errnm_local)
+      do m = 1, 5
+         errnm_local(m) = 0.0d+00
+      end do
+!$omp do
+      do k = 2, nz-1
+         do j = jst, jend
+            do i = ist, iend
+               call exact( i, j, k, u000ijk )
+               do m = 1, 5
+                  tmp = ( u000ijk(m) - u(m,i,j,k) )
+                  errnm_local(m) = errnm_local(m) + tmp * tmp
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+      do m = 1, 5
+!$omp atomic
+         errnm(m) = errnm(m) + errnm_local(m)
+      end do
+!$omp end parallel
+
+      do m = 1, 5
+         errnm(m) = sqrt ( errnm(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+c        write (*,1002) ( errnm(m), m = 1, 5 )
+
+ 1002 format (1x/1x,'RMS-norm of error in soln. to ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fifth pde  = ',1pe12.5)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/exact.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/exact.f
new file mode 100644
index 0000000..5a7c958
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/exact.f
@@ -0,0 +1,53 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact( i, j, k, u000ijk )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the exact solution at (i,j,k)
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer i, j, k
+      double precision u000ijk(*)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer m
+      double precision xi, eta, zeta
+
+      xi  = ( dble ( i - 1 ) ) / ( nx0 - 1 )
+      eta  = ( dble ( j - 1 ) ) / ( ny0 - 1 )
+      zeta = ( dble ( k - 1 ) ) / ( nz - 1 )
+
+
+      do m = 1, 5
+         u000ijk(m) =  ce(m,1)
+     >        + (ce(m,2)
+     >        + (ce(m,5)
+     >        + (ce(m,8)
+     >        +  ce(m,11) * xi) * xi) * xi) * xi
+     >        + (ce(m,3)
+     >        + (ce(m,6)
+     >        + (ce(m,9)
+     >        +  ce(m,12) * eta) * eta) * eta) * eta
+     >        + (ce(m,4)
+     >        + (ce(m,7)
+     >        + (ce(m,10)
+     >        +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/inputlu.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/inputlu.data.sample
new file mode 100644
index 0000000..9ef5a7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/inputlu.data.sample
@@ -0,0 +1,24 @@
+c
+c***controls printing of the progress of iterations: ipr    inorm
+                                                      1      250
+c
+c***the maximum no. of pseudo-time steps to be performed: nitmax
+                                                             250
+c
+c***magnitude of the time step: dt 
+                               2.0e+00
+c
+c***relaxation factor for SSOR iterations: omega
+                                            1.2
+c
+c***tolerance levels for steady-state residuals: tolnwt(m),m=1,5
+                             1.0e-08   1.0e-08   1.0e-08  1.0e-08  1.0e-08 
+c
+c***number of grid points in xi and eta and zeta directions: nx   ny   nz
+                                                            64  64  64
+c
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacld.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacld.f
new file mode 100644
index 0000000..b7c735f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacld.f
@@ -0,0 +1,358 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacld(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   compute the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+!$omp do schedule(static)
+         do j = jst, jend
+            do i = ist, iend
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) = -dt * 2.0d+00
+     >          * (  tx1 * r43 + ty1 + tz1  )
+     >          * c34 * tmp2 * u(2,i,j,k)
+               d(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 * c34 * tmp1 
+     >          * (  tx1 * r43 + ty1 + tz1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = -dt * 2.0d+00
+     >           * (  tx1 + ty1 * r43 + tz1  )
+     >           * c34 * tmp2 * u(3,i,j,k)
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 * r43 + tz1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = -dt * 2.0d+00
+     >           * (  tx1 + ty1 + tz1 * r43  )
+     >           * c34 * tmp2 * u(4,i,j,k)
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 + tz1 * r43 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = -dt * 2.0d+00
+     >  * ( ( ( tx1 * ( r43*c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( r43*c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )
+     >      ) * tmp3
+     >   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               d(5,2,i,j) = dt * 2.0d+00 * tmp2 * u(2,i,j,k)
+     > * ( tx1 * ( r43*c34 - c1345 )
+     >   + ty1 * (     c34 - c1345 )
+     >   + tz1 * (     c34 - c1345 ) )
+               d(5,3,i,j) = dt * 2.0d+00 * tmp2 * u(3,i,j,k)
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( r43*c34 -c1345 )
+     >   + tz1 * ( c34 - c1345 ) )
+               d(5,4,i,j) = dt * 2.0d+00 * tmp2 * u(4,i,j,k)
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( c34 - c1345 )
+     >   + tz1 * ( r43*c34 - c1345 ) )
+               d(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1  + ty1 + tz1 ) * c1345 * tmp1
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tz1 * dz1
+               a(1,2,i,j) =   0.0d+00
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) = - dt * tz2
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) = - dt * tz2
+     >           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               a(2,3,i,j) = 0.0d+00
+               a(2,4,i,j) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i,j) = 0.0d+00
+
+               a(3,1,i,j) = - dt * tz2
+     >           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i,j) = 0.0d+00
+               a(3,3,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               a(3,4,i,j) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = - dt * tz2
+     >        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2
+     >            + c2 * qs(i,j,k-1) * tmp1 )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i,j) = - dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k-1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               a(4,5,i,j) = - dt * tz2 * c2
+
+               a(5,1,i,j) = - dt * tz2
+     >       * ( ( c2 * 2.0d0 * qs(i,j,k-1)
+     >       - c1 * u(5,i,j,k-1) )
+     >            * u(4,i,j,k-1) * tmp2 )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i,j) = - dt * tz2
+     >       * ( c1 * ( u(5,i,j,k-1) * tmp1 )
+     >       - c2
+     >       * ( qs(i,j,k-1) * tmp1
+     >            + u(4,i,j,k-1)*u(4,i,j,k-1) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i,j) = - dt * tz2
+     >       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) = - dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) = - dt * ty2
+     >           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               b(2,3,i,j) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) = - dt * ty2
+     >           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2
+     >       + c2 * ( qs(i,j-1,k) * tmp1 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i,j) = - dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j-1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               b(3,4,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i,j) = - dt * ty2 * c2
+
+               b(4,1,i,j) = - dt * ty2
+     >              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) = - dt * ty2
+     >          * ( ( c2 * 2.0d0 * qs(i,j-1,k)
+     >               - c1 * u(5,i,j-1,k) )
+     >          * ( u(3,i,j-1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)
+     >              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i,j) = - dt * ty2
+     >          * ( c1 * ( u(5,i,j-1,k) * tmp1 )
+     >          - c2 
+     >          * ( qs(i,j-1,k) * tmp1
+     >               + u(3,i,j-1,k)*u(3,i,j-1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i,j) = - dt * ty2
+     >          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tx1 * dx1
+               c(1,2,i,j) = - dt * tx2
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) =   0.0d+00
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2
+     >       + c2 * qs(i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i,j) = - dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               c(2,3,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i,j) = - dt * tx2 * c2 
+
+               c(3,1,i,j) = - dt * tx2
+     >              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i,j) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               c(3,4,i,j) = 0.0d+00
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i,j) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i,j) = 0.0d+00
+               c(4,4,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               c(4,5,i,j) = 0.0d+00
+
+               c(5,1,i,j) = - dt * tx2
+     >          * ( ( c2 * 2.0d0 * qs(i-1,j,k)
+     >              - c1 * u(5,i-1,j,k) )
+     >          * u(2,i-1,j,k) * tmp2 )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i,j) = - dt * tx2
+     >          * ( c1 * ( u(5,i-1,j,k) * tmp1 )
+     >             - c2
+     >             * ( u(2,i-1,j,k)*u(2,i-1,j,k) * tmp2
+     >                  + qs(i-1,j,k) * tmp1 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i,j) = - dt * tx2
+     >           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+            end do
+         end do
+!$omp end do nowait
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacu.f
new file mode 100644
index 0000000..0264579
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/jacu.f
@@ -0,0 +1,358 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacu(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+!$omp do schedule(static)
+         do j = jend, jst, -1
+            do i = iend, ist, -1
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               du(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               du(1,2,i,j) =  0.0d+00
+               du(1,3,i,j) =  0.0d+00
+               du(1,4,i,j) =  0.0d+00
+               du(1,5,i,j) =  0.0d+00
+
+               du(2,1,i,j) =  dt * 2.0d+00
+     >           * ( - tx1 * r43 - ty1 - tz1 )
+     >           * ( c34 * tmp2 * u(2,i,j,k) )
+               du(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 * c34 * tmp1 
+     >          * (  tx1 * r43 + ty1 + tz1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               du(2,3,i,j) = 0.0d+00
+               du(2,4,i,j) = 0.0d+00
+               du(2,5,i,j) = 0.0d+00
+
+               du(3,1,i,j) = dt * 2.0d+00
+     >           * ( - tx1 - ty1 * r43 - tz1 )
+     >           * ( c34 * tmp2 * u(3,i,j,k) )
+               du(3,2,i,j) = 0.0d+00
+               du(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 * r43 + tz1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               du(3,4,i,j) = 0.0d+00
+               du(3,5,i,j) = 0.0d+00
+
+               du(4,1,i,j) = dt * 2.0d+00
+     >           * ( - tx1 - ty1 - tz1 * r43 )
+     >           * ( c34 * tmp2 * u(4,i,j,k) )
+               du(4,2,i,j) = 0.0d+00
+               du(4,3,i,j) = 0.0d+00
+               du(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 + tz1 * r43 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               du(4,5,i,j) = 0.0d+00
+
+               du(5,1,i,j) = -dt * 2.0d+00
+     >  * ( ( ( tx1 * ( r43*c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( r43*c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )
+     >      ) * tmp3
+     >   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               du(5,2,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( r43*c34 - c1345 )
+     >   + ty1 * (     c34 - c1345 )
+     >   + tz1 * (     c34 - c1345 ) ) * tmp2 * u(2,i,j,k)
+               du(5,3,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( r43*c34 -c1345 )
+     >   + tz1 * ( c34 - c1345 ) ) * tmp2 * u(3,i,j,k)
+               du(5,4,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( c34 - c1345 )
+     >   + tz1 * ( r43*c34 - c1345 ) ) * tmp2 * u(4,i,j,k)
+               du(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1 + ty1 + tz1 ) * c1345 * tmp1
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               au(1,1,i,j) = - dt * tx1 * dx1
+               au(1,2,i,j) =   dt * tx2
+               au(1,3,i,j) =   0.0d+00
+               au(1,4,i,j) =   0.0d+00
+               au(1,5,i,j) =   0.0d+00
+
+               au(2,1,i,j) =  dt * tx2
+     >          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2
+     >     + c2 * qs(i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               au(2,2,i,j) =  dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               au(2,3,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               au(2,4,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               au(2,5,i,j) =  dt * tx2 * c2 
+
+               au(3,1,i,j) =  dt * tx2
+     >              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               au(3,2,i,j) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               au(3,3,i,j) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               au(3,4,i,j) = 0.0d+00
+               au(3,5,i,j) = 0.0d+00
+
+               au(4,1,i,j) = dt * tx2
+     >          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               au(4,2,i,j) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               au(4,3,i,j) = 0.0d+00
+               au(4,4,i,j) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               au(4,5,i,j) = 0.0d+00
+
+               au(5,1,i,j) = dt * tx2
+     >          * ( ( c2 * 2.0d0 * qs(i+1,j,k)
+     >              - c1 * u(5,i+1,j,k) )
+     >          * ( u(2,i+1,j,k) * tmp2 ) )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i+1,j,k) )
+               au(5,2,i,j) = dt * tx2
+     >          * ( c1 * ( u(5,i+1,j,k) * tmp1 )
+     >             - c2
+     >             * (  u(2,i+1,j,k)*u(2,i+1,j,k) * tmp2
+     >                  + qs(i+1,j,k) * tmp1 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               au(5,3,i,j) = dt * tx2
+     >           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               au(5,4,i,j) = dt * tx2
+     >           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               au(5,5,i,j) = dt * tx2
+     >           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               bu(1,1,i,j) = - dt * ty1 * dy1
+               bu(1,2,i,j) =   0.0d+00
+               bu(1,3,i,j) =  dt * ty2
+               bu(1,4,i,j) =   0.0d+00
+               bu(1,5,i,j) =   0.0d+00
+
+               bu(2,1,i,j) =  dt * ty2
+     >           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               bu(2,2,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               bu(2,3,i,j) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               bu(2,4,i,j) = 0.0d+00
+               bu(2,5,i,j) = 0.0d+00
+
+               bu(3,1,i,j) =  dt * ty2
+     >           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2
+     >      + c2 * ( qs(i,j+1,k) * tmp1 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               bu(3,2,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               bu(3,3,i,j) =  dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j+1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               bu(3,4,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               bu(3,5,i,j) =  dt * ty2 * c2
+
+               bu(4,1,i,j) =  dt * ty2
+     >              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               bu(4,2,i,j) = 0.0d+00
+               bu(4,3,i,j) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               bu(4,4,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               bu(4,5,i,j) = 0.0d+00
+
+               bu(5,1,i,j) =  dt * ty2
+     >          * ( ( c2 * 2.0d0 * qs(i,j+1,k)
+     >               - c1 * u(5,i,j+1,k) )
+     >          * ( u(3,i,j+1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)
+     >              - c1345*tmp2*u(5,i,j+1,k) )
+               bu(5,2,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               bu(5,3,i,j) =  dt * ty2
+     >          * ( c1 * ( u(5,i,j+1,k) * tmp1 )
+     >          - c2 
+     >          * ( qs(i,j+1,k) * tmp1
+     >               + u(3,i,j+1,k)*u(3,i,j+1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               bu(5,4,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               bu(5,5,i,j) =  dt * ty2
+     >          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               cu(1,1,i,j) = - dt * tz1 * dz1
+               cu(1,2,i,j) =   0.0d+00
+               cu(1,3,i,j) =   0.0d+00
+               cu(1,4,i,j) = dt * tz2
+               cu(1,5,i,j) =   0.0d+00
+
+               cu(2,1,i,j) = dt * tz2
+     >           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               cu(2,2,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               cu(2,3,i,j) = 0.0d+00
+               cu(2,4,i,j) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               cu(2,5,i,j) = 0.0d+00
+
+               cu(3,1,i,j) = dt * tz2
+     >           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               cu(3,2,i,j) = 0.0d+00
+               cu(3,3,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               cu(3,4,i,j) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               cu(3,5,i,j) = 0.0d+00
+
+               cu(4,1,i,j) = dt * tz2
+     >        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2
+     >            + c2 * ( qs(i,j,k+1) * tmp1 ) )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               cu(4,2,i,j) = dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               cu(4,3,i,j) = dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               cu(4,4,i,j) = dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k+1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               cu(4,5,i,j) = dt * tz2 * c2
+
+               cu(5,1,i,j) = dt * tz2
+     >     * ( ( c2 * 2.0d0 * qs(i,j,k+1)
+     >       - c1 * u(5,i,j,k+1) )
+     >            * ( u(4,i,j,k+1) * tmp2 ) )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k+1) )
+               cu(5,2,i,j) = dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               cu(5,3,i,j) = dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               cu(5,4,i,j) = dt * tz2
+     >       * ( c1 * ( u(5,i,j,k+1) * tmp1 )
+     >       - c2
+     >       * ( qs(i,j,k+1) * tmp1
+     >            + u(4,i,j,k+1)*u(4,i,j,k+1) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               cu(5,5,i,j) = dt * tz2
+     >       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+            end do
+         end do
+!$omp end do nowait
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/l2norm.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/l2norm.f
new file mode 100644
index 0000000..a1c3108
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/l2norm.f
@@ -0,0 +1,69 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine l2norm ( ldx, ldy, ldz, 
+     >                    nx0, ny0, nz0,
+     >                    ist, iend, 
+     >                    jst, jend,
+     >                    v, sum )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to compute the l2-norm of vector v.
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldx, ldy, ldz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v(5,ldx/2*2+1,ldy/2*2+1,*), sum(5)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      double precision  sum_local(5)
+      integer i, j, k, m
+
+
+      do m = 1, 5
+         sum(m) = 0.0d+00
+      end do
+
+!$omp parallel default(shared) private(i,j,k,m,sum_local)
+      do m = 1, 5
+         sum_local(m) = 0.0d+00
+      end do
+!$omp do
+      do k = 2, nz0-1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  sum_local(m) = sum_local(m) + v(m,i,j,k)*v(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+      do m = 1, 5
+!$omp atomic
+         sum(m) = sum(m) + sum_local(m)
+      end do
+!$omp end parallel
+
+      do m = 1, 5
+         sum(m) = sqrt ( sum(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/lu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/lu.f
new file mode 100644
index 0000000..397d810
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/lu.f
@@ -0,0 +1,204 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   L U                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB LU code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: S. Weeratunga
+c          V. Venkatakrishnan
+c          E. Barszcz
+c          M. Yarrow
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program applu
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   driver for the performance evaluation of the solver for
+c   five coupled parabolic/elliptic partial differential equations.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+      character class
+      logical verified
+      double precision mflops
+
+      double precision t, tmax, timer_read, trecs(t_last)
+      external timer_read
+      integer i, fstatus
+      character t_names(t_last)*8
+
+c---------------------------------------------------------------------
+c     Setup info for timers
+c---------------------------------------------------------------------
+
+      open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_jacld) = 'jacld'
+         t_names(t_blts) = 'blts'
+         t_names(t_jacu) = 'jacu'
+         t_names(t_buts) = 'buts'
+         t_names(t_add) = 'add'
+         t_names(t_l2norm) = 'l2norm'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+c---------------------------------------------------------------------
+c   read input data
+c---------------------------------------------------------------------
+      call read_input()
+
+#ifdef HOOKS
+      call roi_begin
+#endif
+
+c---------------------------------------------------------------------
+c   set up domain sizes
+c---------------------------------------------------------------------
+      call domain()
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      call setcoeff()
+
+c---------------------------------------------------------------------
+c   set the boundary values for dependent variables
+c---------------------------------------------------------------------
+      call setbv()
+
+c---------------------------------------------------------------------
+c   set the initial values for dependent variables
+c---------------------------------------------------------------------
+      call setiv()
+
+c---------------------------------------------------------------------
+c   compute the forcing term based on prescribed exact solution
+c---------------------------------------------------------------------
+      call erhs()
+
+c---------------------------------------------------------------------
+c   perform one SSOR iteration to touch all data pages
+c---------------------------------------------------------------------
+      call ssor(1)
+
+c---------------------------------------------------------------------
+c   reset the boundary and initial values
+c---------------------------------------------------------------------
+      call setbv()
+      call setiv()
+
+c---------------------------------------------------------------------
+c   perform the SSOR iterations
+c---------------------------------------------------------------------
+      call ssor(itmax)
+
+c---------------------------------------------------------------------
+c   compute the solution error
+c---------------------------------------------------------------------
+      call error()
+
+c---------------------------------------------------------------------
+c   compute the surface integral
+c---------------------------------------------------------------------
+      call pintgr()
+
+#ifdef HOOKS
+      call roi_end
+#endif
+
+c---------------------------------------------------------------------
+c   verification test
+c---------------------------------------------------------------------
+      call verify ( rsdnm, errnm, frc, class, verified )
+      mflops = float(itmax)*(1984.77*float( nx0 )
+     >     *float( ny0 )
+     >     *float( nz0 )
+     >     -10923.3*(float( nx0+ny0+nz0 )/3.)**2 
+     >     +27770.9* float( nx0+ny0+nz0 )/3.
+     >     -144010.)
+     >     / (maxtime*1000000.)
+
+      call print_results('LU', class, nx0,
+     >  ny0, nz0, itmax,
+     >  maxtime, mflops, '          floating point', verified, 
+     >  npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, 
+     >  '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      tmax = maxtime
+      if ( tmax .eq. 0. ) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         if (i.eq.t_rhs) then
+            t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+            write(*,820) 'sub-rhs', t, t*100./tmax
+            t = trecs(i) - t
+            write(*,820) 'rest-rhs', t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format(5x,'--> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/pintgr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/pintgr.f
new file mode 100644
index 0000000..e554886
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/pintgr.f
@@ -0,0 +1,195 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine pintgr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k
+      integer ibeg, ifin, ifin1
+      integer jbeg, jfin, jfin1
+      double precision  phi1(0:isiz2+1,0:isiz3+1),
+     >                  phi2(0:isiz2+1,0:isiz3+1)
+      double precision  frc1, frc2, frc3
+
+
+
+c---------------------------------------------------------------------
+c   set up the sub-domains for integeration in each processor
+c---------------------------------------------------------------------
+      ibeg = ii1
+      ifin = ii2
+      jbeg = ji1
+      jfin = ji2
+      ifin1 = ifin - 1
+      jfin1 = jfin - 1
+
+!$omp parallel default(shared) private(i,j,k)
+!$omp&  shared(ki1,ki2,ifin,ibeg,jfin,jbeg,ifin1,jfin1)
+
+!$omp do
+      do j = jbeg,jfin
+         do i = ibeg,ifin
+
+            k = ki1
+
+            phi1(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+
+            k = ki2
+
+            phi2(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc1 = 0.0d+00
+!$omp end single
+
+!$omp do reduction(+:frc1)
+      do j = jbeg,jfin1
+         do i = ibeg, ifin1
+            frc1 = frc1 + (  phi1(i,j)
+     >                     + phi1(i+1,j)
+     >                     + phi1(i,j+1)
+     >                     + phi1(i+1,j+1)
+     >                     + phi2(i,j)
+     >                     + phi2(i+1,j)
+     >                     + phi2(i,j+1)
+     >                     + phi2(i+1,j+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp single
+      frc1 = dxi * deta * frc1
+!$omp end single nowait
+
+
+!$omp do
+      do k = ki1, ki2
+         do i = ibeg, ifin
+            phi1(i,k) = c2*(  u(5,i,jbeg,k)
+     >           - 0.50d+00 * (  u(2,i,jbeg,k) ** 2
+     >                         + u(3,i,jbeg,k) ** 2
+     >                         + u(4,i,jbeg,k) ** 2 )
+     >                        / u(1,i,jbeg,k) )
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do
+      do k = ki1, ki2
+         do i = ibeg, ifin
+            phi2(i,k) = c2*(  u(5,i,jfin,k)
+     >           - 0.50d+00 * (  u(2,i,jfin,k) ** 2
+     >                         + u(3,i,jfin,k) ** 2
+     >                         + u(4,i,jfin,k) ** 2 )
+     >                        / u(1,i,jfin,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc2 = 0.0d+00
+!$omp end single
+
+!$omp do reduction(+:frc2)
+      do k = ki1, ki2-1
+         do i = ibeg, ifin1
+            frc2 = frc2 + (  phi1(i,k)
+     >                     + phi1(i+1,k)
+     >                     + phi1(i,k+1)
+     >                     + phi1(i+1,k+1)
+     >                     + phi2(i,k)
+     >                     + phi2(i+1,k)
+     >                     + phi2(i,k+1)
+     >                     + phi2(i+1,k+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp single
+      frc2 = dxi * dzeta * frc2
+!$omp end single nowait
+
+
+!$omp do
+      do k = ki1, ki2
+         do j = jbeg, jfin
+            phi1(j,k) = c2*(  u(5,ibeg,j,k)
+     >           - 0.50d+00 * (  u(2,ibeg,j,k) ** 2
+     >                         + u(3,ibeg,j,k) ** 2
+     >                         + u(4,ibeg,j,k) ** 2 )
+     >                        / u(1,ibeg,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do
+      do k = ki1, ki2
+         do j = jbeg, jfin
+            phi2(j,k) = c2*(  u(5,ifin,j,k)
+     >           - 0.50d+00 * (  u(2,ifin,j,k) ** 2
+     >                         + u(3,ifin,j,k) ** 2
+     >                         + u(4,ifin,j,k) ** 2 )
+     >                        / u(1,ifin,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc3 = 0.0d+00
+!$omp end single
+
+!$omp do reduction(+:frc3)
+      do k = ki1, ki2-1
+         do j = jbeg, jfin1
+            frc3 = frc3 + (  phi1(j,k)
+     >                     + phi1(j+1,k)
+     >                     + phi1(j,k+1)
+     >                     + phi1(j+1,k+1)
+     >                     + phi2(j,k)
+     >                     + phi2(j+1,k)
+     >                     + phi2(j,k+1)
+     >                     + phi2(j+1,k+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp single
+      frc3 = deta * dzeta * frc3
+!$omp end single nowait
+!$omp end parallel
+
+      frc = 0.25d+00 * ( frc1 + frc2 + frc3 )
+c      write (*,1001) frc
+
+      return
+
+c 1001 format (//5x,'surface integral = ',1pe12.5//)
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/read_input.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/read_input.f
new file mode 100644
index 0000000..a215550
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/read_input.f
@@ -0,0 +1,118 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine read_input
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+      integer  fstatus
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+
+
+c---------------------------------------------------------------------
+c    if input file does not exist, it uses defaults
+c       ipr = 1 for detailed progress output
+c       inorm = how often the norm is printed (once every inorm iterations)
+c       itmax = number of pseudo time steps
+c       dt = time step
+c       omega 1 over-relaxation factor for SSOR
+c       tolrsd = steady state residual tolerance levels
+c       nx, ny, nz = number of grid points in x, y, z directions
+c---------------------------------------------------------------------
+
+         write(*, 1000)
+
+         open (unit=3,file='inputlu.data',status='old',
+     >         access='sequential',form='formatted', iostat=fstatus)
+         if (fstatus .eq. 0) then
+
+            write(*, *) 'Reading from input file inputlu.data'
+
+            read (3,*)
+            read (3,*)
+            read (3,*) ipr, inorm
+            read (3,*)
+            read (3,*)
+            read (3,*) itmax
+            read (3,*)
+            read (3,*)
+            read (3,*) dt
+            read (3,*)
+            read (3,*)
+            read (3,*) omega
+            read (3,*)
+            read (3,*)
+            read (3,*) tolrsd(1),tolrsd(2),tolrsd(3),tolrsd(4),tolrsd(5)
+            read (3,*)
+            read (3,*)
+            read (3,*) nx0, ny0, nz0
+            close(3)
+         else
+            ipr = ipr_default
+            inorm = inorm_default
+            itmax = itmax_default
+            dt = dt_default
+            omega = omega_default
+            tolrsd(1) = tolrsd1_def
+            tolrsd(2) = tolrsd2_def
+            tolrsd(3) = tolrsd3_def
+            tolrsd(4) = tolrsd4_def
+            tolrsd(5) = tolrsd5_def
+            nx0 = isiz1
+            ny0 = isiz2
+            nz0 = isiz3
+         endif
+
+c---------------------------------------------------------------------
+c   check problem size
+c---------------------------------------------------------------------
+
+         if ( ( nx0 .lt. 4 ) .or.
+     >        ( ny0 .lt. 4 ) .or.
+     >        ( nz0 .lt. 4 ) ) then
+
+            write (*,2001)
+ 2001       format (5x,'PROBLEM SIZE IS TOO SMALL - ',
+     >           /5x,'SET EACH OF NX, NY AND NZ AT LEAST EQUAL TO 5')
+            stop
+
+         end if
+
+         if ( ( nx0 .gt. isiz1 ) .or.
+     >        ( ny0 .gt. isiz2 ) .or.
+     >        ( nz0 .gt. isiz3 ) ) then
+
+            write (*,2002)
+ 2002       format (5x,'PROBLEM SIZE IS TOO LARGE - ',
+     >           /5x,'NX, NY AND NZ SHOULD BE EQUAL TO ',
+     >           /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY')
+            stop
+
+         end if
+
+
+         write(*, 1001) nx0, ny0, nz0
+         write(*, 1002) itmax
+!$       write(*, 1003) omp_get_max_threads()
+         write(*, *)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - LU Benchmark', /)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations:                  ', i5)
+ 1003    format(' Number of available threads: ', i5)
+         
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs.f
new file mode 100644
index 0000000..f668ba2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs.f
@@ -0,0 +1,455 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand sides
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  q
+      double precision  tmp, utmp(6,isiz3), rtmp(5,isiz3)
+      double precision  u21, u31, u41
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+
+      if (timeron) call timer_start(t_rhs)
+!$omp parallel default(shared) private(i,j,k,m,q,flux,tmp,utmp,rtmp,
+!$omp& u51im1,u41im1,u31im1,u21im1,u51i,u41i,u31i,u21i,u21,
+!$omp& u51jm1,u41jm1,u31jm1,u21jm1,u51j,u41j,u31j,u21j,u31,
+!$omp& u51km1,u41km1,u31km1,u21km1,u51k,u41k,u31k,u21k,u41)
+!$omp do schedule(static)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+               tmp = 1.0d+00 / u(1,i,j,k)
+               rho_i(i,j,k) = tmp
+               qs(i,j,k) = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      * tmp
+            end do
+         end do
+      end do
+!$omp end do
+
+!$omp master
+      if (timeron) call timer_start(t_rhsx)
+!$omp end master
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = u(2,i,j,k)
+               u21 = u(2,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,i) = u(2,i,j,k) * u21 + c2 * 
+     >                        ( u(5,i,j,k) - q )
+               flux(3,i) = u(3,i,j,k) * u21
+               flux(4,i) = u(4,i,j,k) * u21
+               flux(5,i) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                 - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+
+            do i = ist, nx
+               tmp = rho_i(i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dx1 * tx1 * (            u(1,i-1,j,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            u(2,i-1,j,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            u(3,i-1,j,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            u(4,i-1,j,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            u(5,i-1,j,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * u(m,2,j,k)
+     >                      - 4.0d+00 * u(m,3,j,k)
+     >                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * u(m,2,j,k)
+     >                      + 6.0d+00 * u(m,3,j,k)
+     >                      - 4.0d+00 * u(m,4,j,k)
+     >                      +           u(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i-2,j,k)
+     >                        - 4.0d+00 * u(m,i-1,j,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i+1,j,k)
+     >                        +           u(m,i+2,j,k) )
+               end do
+            end do
+
+
+            do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)
+     >           - dssp * (             u(m,nx-4,j,k)
+     >                      - 4.0d+00 * u(m,nx-3,j,k)
+     >                      + 6.0d+00 * u(m,nx-2,j,k)
+     >                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)
+     >           - dssp * (             u(m,nx-3,j,k)
+     >                      - 4.0d+00 * u(m,nx-2,j,k)
+     >                      + 5.0d+00 * u(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+!$omp end master
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = u(3,i,j,k)
+               u31 = u(3,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,j) = u(2,i,j,k) * u31 
+               flux(3,j) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,j) = u(4,i,j,k) * u31
+               flux(5,j) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                   - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = rho_i(i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dy1 * ty1 * (            u(1,i,j-1,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            u(2,i,j-1,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            u(3,i,j-1,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            u(4,i,j-1,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            u(5,i,j-1,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * u(m,i,2,k)
+     >                      - 4.0d+00 * u(m,i,3,k)
+     >                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * u(m,i,2,k)
+     >                      + 6.0d+00 * u(m,i,3,k)
+     >                      - 4.0d+00 * u(m,i,4,k)
+     >                      +           u(m,i,5,k) )
+            end do
+         end do
+
+         do j = 4, ny - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j-2,k)
+     >                        - 4.0d+00 * u(m,i,j-1,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j+1,k)
+     >                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)
+     >           - dssp * (             u(m,i,ny-4,k)
+     >                      - 4.0d+00 * u(m,i,ny-3,k)
+     >                      + 6.0d+00 * u(m,i,ny-2,k)
+     >                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)
+     >           - dssp * (             u(m,i,ny-3,k)
+     >                      - 4.0d+00 * u(m,i,ny-2,k)
+     >                      + 5.0d+00 * u(m,i,ny-1,k) )
+            end do
+         end do
+
+      end do
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+!$omp end master
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               utmp(1,k) = u(1,i,j,k)
+               utmp(2,k) = u(2,i,j,k)
+               utmp(3,k) = u(3,i,j,k)
+               utmp(4,k) = u(4,i,j,k)
+               utmp(5,k) = u(5,i,j,k)
+               utmp(6,k) = rho_i(i,j,k)
+            end do
+            do k = 1, nz
+               flux(1,k) = utmp(4,k)
+               u41 = utmp(4,k) * utmp(6,k)
+
+               q = qs(i,j,k)
+
+               flux(2,k) = utmp(2,k) * u41 
+               flux(3,k) = utmp(3,k) * u41 
+               flux(4,k) = utmp(4,k) * u41 + c2 * (utmp(5,k)-q)
+               flux(5,k) = ( c1 * utmp(5,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  rtmp(m,k) =  rsd(m,i,j,k)
+     >                - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = utmp(6,k)
+
+               u21k = tmp * utmp(2,k)
+               u31k = tmp * utmp(3,k)
+               u41k = tmp * utmp(4,k)
+               u51k = tmp * utmp(5,k)
+
+               tmp = utmp(6,k-1)
+
+               u21km1 = tmp * utmp(2,k-1)
+               u31km1 = tmp * utmp(3,k-1)
+               u41km1 = tmp * utmp(4,k-1)
+               u51km1 = tmp * utmp(5,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               rtmp(1,k) = rtmp(1,k)
+     >              + dz1 * tz1 * (            utmp(1,k-1)
+     >                             - 2.0d+00 * utmp(1,k)
+     >                             +           utmp(1,k+1) )
+               rtmp(2,k) = rtmp(2,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            utmp(2,k-1)
+     >                             - 2.0d+00 * utmp(2,k)
+     >                             +           utmp(2,k+1) )
+               rtmp(3,k) = rtmp(3,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            utmp(3,k-1)
+     >                             - 2.0d+00 * utmp(3,k)
+     >                             +           utmp(3,k+1) )
+               rtmp(4,k) = rtmp(4,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            utmp(4,k-1)
+     >                             - 2.0d+00 * utmp(4,k)
+     >                             +           utmp(4,k+1) )
+               rtmp(5,k) = rtmp(5,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            utmp(5,k-1)
+     >                             - 2.0d+00 * utmp(5,k)
+     >                             +           utmp(5,k+1) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,i,j,2) = rtmp(m,2)
+     >           - dssp * ( + 5.0d+00 * utmp(m,2)
+     >                      - 4.0d+00 * utmp(m,3)
+     >                      +           utmp(m,4) )
+               rsd(m,i,j,3) = rtmp(m,3)
+     >           - dssp * ( - 4.0d+00 * utmp(m,2)
+     >                      + 6.0d+00 * utmp(m,3)
+     >                      - 4.0d+00 * utmp(m,4)
+     >                      +           utmp(m,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rtmp(m,k)
+     >              - dssp * (            utmp(m,k-2)
+     >                        - 4.0d+00 * utmp(m,k-1)
+     >                        + 6.0d+00 * utmp(m,k)
+     >                        - 4.0d+00 * utmp(m,k+1)
+     >                        +           utmp(m,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rtmp(m,nz-2)
+     >           - dssp * (             utmp(m,nz-4)
+     >                      - 4.0d+00 * utmp(m,nz-3)
+     >                      + 6.0d+00 * utmp(m,nz-2)
+     >                      - 4.0d+00 * utmp(m,nz-1)  )
+               rsd(m,i,j,nz-1) = rtmp(m,nz-1)
+     >           - dssp * (             utmp(m,nz-3)
+     >                      - 4.0d+00 * utmp(m,nz-2)
+     >                      + 5.0d+00 * utmp(m,nz-1) )
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+!$omp end parallel
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs_vec.f
new file mode 100644
index 0000000..bd6f113
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/rhs_vec.f
@@ -0,0 +1,459 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand sides
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  q
+      double precision  tmp
+      double precision  u21, u31, u41
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+
+      if (timeron) call timer_start(t_rhs)
+!$omp parallel default(shared) private(i,j,k,m,q,flux,tmp,
+!$omp& u51im1,u41im1,u31im1,u21im1,u51i,u41i,u31i,u21i,u21,
+!$omp& u51jm1,u41jm1,u31jm1,u21jm1,u51j,u41j,u31j,u21j,u31,
+!$omp& u51km1,u41km1,u31km1,u21km1,u51k,u41k,u31k,u21k,u41)
+!$omp do schedule(static)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+               tmp = 1.0d+00 / u(1,i,j,k)
+               rho_i(i,j,k) = tmp
+               qs(i,j,k) = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      * tmp
+            end do
+         end do
+      end do
+!$omp end do
+
+!$omp master
+      if (timeron) call timer_start(t_rhsx)
+!$omp end master
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = u(2,i,j,k)
+               u21 = u(2,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,i) = u(2,i,j,k) * u21 + c2 * 
+     >                        ( u(5,i,j,k) - q )
+               flux(3,i) = u(3,i,j,k) * u21
+               flux(4,i) = u(4,i,j,k) * u21
+               flux(5,i) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                 - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+
+            do i = ist, nx
+               tmp = rho_i(i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dx1 * tx1 * (            u(1,i-1,j,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            u(2,i-1,j,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            u(3,i-1,j,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            u(4,i-1,j,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            u(5,i-1,j,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i+1,j,k) )
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+         do j = jst, jend
+            do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * u(m,2,j,k)
+     >                      - 4.0d+00 * u(m,3,j,k)
+     >                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * u(m,2,j,k)
+     >                      + 6.0d+00 * u(m,3,j,k)
+     >                      - 4.0d+00 * u(m,4,j,k)
+     >                      +           u(m,5,j,k) )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = 4, nx - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i-2,j,k)
+     >                        - 4.0d+00 * u(m,i-1,j,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i+1,j,k)
+     >                        +           u(m,i+2,j,k) )
+               end do
+            end do
+         end do
+
+
+         do j = jst, jend
+            do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)
+     >           - dssp * (             u(m,nx-4,j,k)
+     >                      - 4.0d+00 * u(m,nx-3,j,k)
+     >                      + 6.0d+00 * u(m,nx-2,j,k)
+     >                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)
+     >           - dssp * (             u(m,nx-3,j,k)
+     >                      - 4.0d+00 * u(m,nx-2,j,k)
+     >                      + 5.0d+00 * u(m,nx-1,j,k) )
+            end do
+         end do
+
+      end do
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+!$omp end master
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = u(3,i,j,k)
+               u31 = u(3,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,j) = u(2,i,j,k) * u31 
+               flux(3,j) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,j) = u(4,i,j,k) * u31
+               flux(5,j) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                   - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = rho_i(i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dy1 * ty1 * (            u(1,i,j-1,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            u(2,i,j-1,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            u(3,i,j-1,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            u(4,i,j-1,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            u(5,i,j-1,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * u(m,i,2,k)
+     >                      - 4.0d+00 * u(m,i,3,k)
+     >                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * u(m,i,2,k)
+     >                      + 6.0d+00 * u(m,i,3,k)
+     >                      - 4.0d+00 * u(m,i,4,k)
+     >                      +           u(m,i,5,k) )
+            end do
+         end do
+
+         do j = 4, ny - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j-2,k)
+     >                        - 4.0d+00 * u(m,i,j-1,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j+1,k)
+     >                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)
+     >           - dssp * (             u(m,i,ny-4,k)
+     >                      - 4.0d+00 * u(m,i,ny-3,k)
+     >                      + 6.0d+00 * u(m,i,ny-2,k)
+     >                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)
+     >           - dssp * (             u(m,i,ny-3,k)
+     >                      - 4.0d+00 * u(m,i,ny-2,k)
+     >                      + 5.0d+00 * u(m,i,ny-1,k) )
+            end do
+         end do
+
+      end do
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+!$omp end master
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               flux(1,k) = u(4,i,j,k)
+               u41 = u(4,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,k) = u(2,i,j,k) * u41 
+               flux(3,k) = u(3,i,j,k) * u41 
+               flux(4,k) = u(4,i,j,k) * u41 + c2 * (u(5,i,j,k)-q)
+               flux(5,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = rho_i(i,j,k)
+
+               u21k = tmp * u(2,i,j,k)
+               u31k = tmp * u(3,i,j,k)
+               u41k = tmp * u(4,i,j,k)
+               u51k = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j,k-1)
+
+               u21km1 = tmp * u(2,i,j,k-1)
+               u31km1 = tmp * u(3,i,j,k-1)
+               u41km1 = tmp * u(4,i,j,k-1)
+               u51km1 = tmp * u(5,i,j,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dz1 * tz1 * (            u(1,i,j,k-1)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j,k+1) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            u(2,i,j,k-1)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j,k+1) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            u(3,i,j,k-1)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j,k+1) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            u(4,i,j,k-1)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j,k+1) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            u(5,i,j,k-1)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j,k+1) )
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,2) = rsd(m,i,j,2)
+     >           - dssp * ( + 5.0d+00 * u(m,i,j,2)
+     >                      - 4.0d+00 * u(m,i,j,3)
+     >                      +           u(m,i,j,4) )
+               rsd(m,i,j,3) = rsd(m,i,j,3)
+     >           - dssp * ( - 4.0d+00 * u(m,i,j,2)
+     >                      + 6.0d+00 * u(m,i,j,3)
+     >                      - 4.0d+00 * u(m,i,j,4)
+     >                      +           u(m,i,j,5) )
+            end do
+         end do
+
+         do k = 4, nz - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j,k-2)
+     >                        - 4.0d+00 * u(m,i,j,k-1)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j,k+1)
+     >                        +           u(m,i,j,k+2) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rsd(m,i,j,nz-2)
+     >           - dssp * (             u(m,i,j,nz-4)
+     >                      - 4.0d+00 * u(m,i,j,nz-3)
+     >                      + 6.0d+00 * u(m,i,j,nz-2)
+     >                      - 4.0d+00 * u(m,i,j,nz-1)  )
+               rsd(m,i,j,nz-1) = rsd(m,i,j,nz-1)
+     >           - dssp * (             u(m,i,j,nz-3)
+     >                      - 4.0d+00 * u(m,i,j,nz-2)
+     >                      + 5.0d+00 * u(m,i,j,nz-1) )
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+!$omp end parallel
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setbv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setbv.f
new file mode 100644
index 0000000..7957a16
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setbv.f
@@ -0,0 +1,76 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setbv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   set the boundary values of dependent variables
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c   local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision temp1(5), temp2(5)
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along the top and bottom faces
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,temp1,temp2)
+!$omp& shared(nx,ny,nz)
+!$omp do schedule(static)
+      do j = 1, ny
+         do i = 1, nx
+            call exact( i, j, 1, temp1 )
+            call exact( i, j, nz, temp2 )
+            do m = 1, 5
+               u( m, i, j, 1 ) = temp1(m)
+               u( m, i, j, nz ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along north and south faces
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, nz
+         do i = 1, nx
+            call exact( i, 1, k, temp1 )
+            call exact( i, ny, k, temp2 )
+            do m = 1, 5
+               u( m, i, 1, k ) = temp1(m)
+               u( m, i, ny, k ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along east and west faces
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 1, nz
+         do j = 1, ny
+            call exact( 1, j, k, temp1 )
+            call exact( nx, j, k, temp2 )
+            do m = 1, 5
+               u( m, 1, j, k ) = temp1(m)
+               u( m, nx, j, k ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setcoeff.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setcoeff.f
new file mode 100644
index 0000000..a1fb473
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setcoeff.f
@@ -0,0 +1,152 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setcoeff
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      dxi = 1.0d+00 / ( nx0 - 1 )
+      deta = 1.0d+00 / ( ny0 - 1 )
+      dzeta = 1.0d+00 / ( nz0 - 1 )
+
+      tx1 = 1.0d+00 / ( dxi * dxi )
+      tx2 = 1.0d+00 / ( 2.0d+00 * dxi )
+      tx3 = 1.0d+00 / dxi
+
+      ty1 = 1.0d+00 / ( deta * deta )
+      ty2 = 1.0d+00 / ( 2.0d+00 * deta )
+      ty3 = 1.0d+00 / deta
+
+      tz1 = 1.0d+00 / ( dzeta * dzeta )
+      tz2 = 1.0d+00 / ( 2.0d+00 * dzeta )
+      tz3 = 1.0d+00 / dzeta
+
+c---------------------------------------------------------------------
+c   diffusion coefficients
+c---------------------------------------------------------------------
+      dx1 = 0.75d+00
+      dx2 = dx1
+      dx3 = dx1
+      dx4 = dx1
+      dx5 = dx1
+
+      dy1 = 0.75d+00
+      dy2 = dy1
+      dy3 = dy1
+      dy4 = dy1
+      dy5 = dy1
+
+      dz1 = 1.00d+00
+      dz2 = dz1
+      dz3 = dz1
+      dz4 = dz1
+      dz5 = dz1
+
+c---------------------------------------------------------------------
+c   fourth difference dissipation
+c---------------------------------------------------------------------
+      dssp = ( max (dx1, dy1, dz1 ) ) / 4.0d+00
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the first pde
+c---------------------------------------------------------------------
+      ce(1,1) = 2.0d+00
+      ce(1,2) = 0.0d+00
+      ce(1,3) = 0.0d+00
+      ce(1,4) = 4.0d+00
+      ce(1,5) = 5.0d+00
+      ce(1,6) = 3.0d+00
+      ce(1,7) = 5.0d-01
+      ce(1,8) = 2.0d-02
+      ce(1,9) = 1.0d-02
+      ce(1,10) = 3.0d-02
+      ce(1,11) = 5.0d-01
+      ce(1,12) = 4.0d-01
+      ce(1,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the second pde
+c---------------------------------------------------------------------
+      ce(2,1) = 1.0d+00
+      ce(2,2) = 0.0d+00
+      ce(2,3) = 0.0d+00
+      ce(2,4) = 0.0d+00
+      ce(2,5) = 1.0d+00
+      ce(2,6) = 2.0d+00
+      ce(2,7) = 3.0d+00
+      ce(2,8) = 1.0d-02
+      ce(2,9) = 3.0d-02
+      ce(2,10) = 2.0d-02
+      ce(2,11) = 4.0d-01
+      ce(2,12) = 3.0d-01
+      ce(2,13) = 5.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the third pde
+c---------------------------------------------------------------------
+      ce(3,1) = 2.0d+00
+      ce(3,2) = 2.0d+00
+      ce(3,3) = 0.0d+00
+      ce(3,4) = 0.0d+00
+      ce(3,5) = 0.0d+00
+      ce(3,6) = 2.0d+00
+      ce(3,7) = 3.0d+00
+      ce(3,8) = 4.0d-02
+      ce(3,9) = 3.0d-02
+      ce(3,10) = 5.0d-02
+      ce(3,11) = 3.0d-01
+      ce(3,12) = 5.0d-01
+      ce(3,13) = 4.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fourth pde
+c---------------------------------------------------------------------
+      ce(4,1) = 2.0d+00
+      ce(4,2) = 2.0d+00
+      ce(4,3) = 0.0d+00
+      ce(4,4) = 0.0d+00
+      ce(4,5) = 0.0d+00
+      ce(4,6) = 2.0d+00
+      ce(4,7) = 3.0d+00
+      ce(4,8) = 3.0d-02
+      ce(4,9) = 5.0d-02
+      ce(4,10) = 4.0d-02
+      ce(4,11) = 2.0d-01
+      ce(4,12) = 1.0d-01
+      ce(4,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fifth pde
+c---------------------------------------------------------------------
+      ce(5,1) = 5.0d+00
+      ce(5,2) = 4.0d+00
+      ce(5,3) = 3.0d+00
+      ce(5,4) = 2.0d+00
+      ce(5,5) = 1.0d-01
+      ce(5,6) = 4.0d-01
+      ce(5,7) = 3.0d-01
+      ce(5,8) = 5.0d-02
+      ce(5,9) = 4.0d-02
+      ce(5,10) = 3.0d-02
+      ce(5,11) = 1.0d-01
+      ce(5,12) = 3.0d-01
+      ce(5,13) = 2.0d-01
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setiv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setiv.f
new file mode 100644
index 0000000..61e775a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/setiv.f
@@ -0,0 +1,64 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine setiv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   set the initial values of independent variables based on tri-linear
+c   interpolation of boundary values in the computational space.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  pxi, peta, pzeta
+      double precision  ue_1jk(5),ue_nx0jk(5),ue_i1k(5),
+     >        ue_iny0k(5),ue_ij1(5),ue_ijnz(5)
+
+
+!$omp parallel do default(shared) private(i,j,k,m,pxi,peta,pzeta,
+!$omp& xi,eta,zeta,ue_ijnz,ue_ij1,ue_iny0k,ue_i1k,ue_nx0jk,ue_1jk)
+!$omp& shared(nx0,ny0,nz)
+      do k = 2, nz - 1
+         zeta = ( dble (k-1) ) / (nz-1)
+         do j = 2, ny - 1
+            eta = ( dble (j-1) ) / (ny0-1)
+            do i = 2, nx - 1
+               xi = ( dble (i-1) ) / (nx0-1)
+               call exact (1,j,k,ue_1jk)
+               call exact (nx0,j,k,ue_nx0jk)
+               call exact (i,1,k,ue_i1k)
+               call exact (i,ny0,k,ue_iny0k)
+               call exact (i,j,1,ue_ij1)
+               call exact (i,j,nz,ue_ijnz)
+               do m = 1, 5
+                  pxi =   ( 1.0d+00 - xi ) * ue_1jk(m)
+     >                              + xi   * ue_nx0jk(m)
+                  peta =  ( 1.0d+00 - eta ) * ue_i1k(m)
+     >                              + eta   * ue_iny0k(m)
+                  pzeta = ( 1.0d+00 - zeta ) * ue_ij1(m)
+     >                              + zeta   * ue_ijnz(m)
+
+                  u( m, i, j, k ) = pxi + peta + pzeta
+     >                 - pxi * peta - peta * pzeta - pzeta * pxi
+     >                 + pxi * peta * pzeta
+
+               end do
+            end do
+         end do
+      end do
+!$omp end parallel do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor.f
new file mode 100644
index 0000000..5b3d174
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor.f
@@ -0,0 +1,314 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to perform pseudo-time stepping SSOR iterations
+c   for five nonlinear pde's.
+c---------------------------------------------------------------------
+
+      implicit none
+      integer niter
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m, n
+      integer istep
+      double precision  tmp, tmp2, tv(5*isiz1*isiz2)
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+c---------------------------------------------------------------------
+c  Thread synchronization for pipeline operation
+c---------------------------------------------------------------------
+      integer isync(0:isiz2), mthreadnum, iam
+      common /threadinfo1/ isync
+      common /threadinfo2/ mthreadnum, iam
+!$omp threadprivate(/threadinfo2/)
+
+!$    external omp_get_thread_num
+!$    integer  omp_get_thread_num
+!$    external omp_get_num_threads
+!$    integer  omp_get_num_threads
+
+ 
+c---------------------------------------------------------------------
+c   begin pseudo-time stepping iterations
+c---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+c---------------------------------------------------------------------
+c   initialize a,b,c,d to zero (guarantees that page tables have been
+c   formed, if applicable on given architecture, before timestepping).
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(m,n,i,j)
+!$omp do
+      do j=jst,jend
+         do i=ist,iend
+            do m=1,5
+               do n=1,5
+                  a(m,n,i,j) = 0.d0
+                  b(m,n,i,j) = 0.d0
+                  c(m,n,i,j) = 0.d0
+                  d(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp do
+      do j=jend,jst,-1
+         do i=iend,ist,-1
+            do m=1,5
+               do n=1,5
+                  au(m,n,i,j) = 0.d0
+                  bu(m,n,i,j) = 0.d0
+                  cu(m,n,i,j) = 0.d0
+                  du(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+      call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the L2 norms of newton iteration residuals
+c---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >             ist, iend, jst, jend,
+     >             rsd, rsdnm )
+
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call timer_start(1)
+ 
+c---------------------------------------------------------------------
+c   the timestep loop
+c---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (mod ( istep, 20) .eq. 0 .or.
+     >         istep .eq. itmax .or.
+     >         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+c---------------------------------------------------------------------
+c   perform SSOR iteration
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,tmp2)
+!$omp&  shared(ist,iend,jst,jend,nx,ny,nz,nx0,ny0,omega)
+!$omp master
+         if (timeron) call timer_start(t_rhs)
+!$omp end master
+         tmp2 = dt
+!$omp do
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = tmp2 * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_rhs)
+!$omp end master
+
+         mthreadnum = 0
+!$       mthreadnum = omp_get_num_threads() - 1
+         if (mthreadnum .gt. jend - jst) mthreadnum = jend - jst
+         iam = 0
+!$       iam = omp_get_thread_num()
+         if (iam .le. mthreadnum) isync(iam) = 0
+!$omp barrier
+
+         do k = 2, nz -1 
+c---------------------------------------------------------------------
+c   form the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+!$omp master
+            if (timeron) call timer_start(t_jacld)
+!$omp end master
+            call jacld(k)
+!$omp master
+            if (timeron) call timer_stop(t_jacld)
+ 
+c---------------------------------------------------------------------
+c   perform the lower triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_blts)
+!$omp end master
+            call blts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, 
+     >                 a, b, c, d,
+     >                 ist, iend, jst, jend, 
+     >                 nx0, ny0 )
+!$omp master
+            if (timeron) call timer_stop(t_blts)
+!$omp end master
+          end do
+!$omp barrier
+
+          do k = nz - 1, 2, -1
+c---------------------------------------------------------------------
+c   form the strictly upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+!$omp master
+            if (timeron) call timer_start(t_jacu)
+!$omp end master
+            call jacu(k)
+!$omp master
+            if (timeron) call timer_stop(t_jacu)
+
+c---------------------------------------------------------------------
+c   perform the upper triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_buts)
+!$omp end master
+            call buts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, tv,
+     >                 du, au, bu, cu,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0 )
+!$omp master
+            if (timeron) call timer_stop(t_buts)
+!$omp end master
+          end do
+!$omp barrier
+ 
+c---------------------------------------------------------------------
+c   update the variables
+c---------------------------------------------------------------------
+
+!$omp master
+         if (timeron) call timer_start(t_add)
+!$omp end master
+         tmp2 = tmp
+!$omp do
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )
+     >                    + tmp2 * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp end parallel
+         if (timeron) call timer_stop(t_add)
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration corrections
+c---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1006) ( delunm(m), m = 1, 5 )
+c            else if ( ipr .eq. 2 ) then
+c                write (*,'(i5,f15.6)') istep,delunm(5)
+c            end if
+         end if
+ 
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+         call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration residuals
+c---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.
+     >        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1007) ( rsdnm(m), m = 1, 5 )
+c            end if
+         end if
+
+c---------------------------------------------------------------------
+c   check the newton-iteration residuals against the tolerance levels
+c---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.
+     >        ( rsdnm(2) .lt. tolrsd(2) ) .and.
+     >        ( rsdnm(3) .lt. tolrsd(3) ) .and.
+     >        ( rsdnm(4) .lt. tolrsd(4) ) .and.
+     >        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+c            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+c            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,
+     >   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor_vec.f
new file mode 100644
index 0000000..fab2e2e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/ssor_vec.f
@@ -0,0 +1,297 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to perform pseudo-time stepping SSOR iterations
+c   for five nonlinear pde's.
+c---------------------------------------------------------------------
+
+      implicit none
+      integer niter
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m, n, lst, lend
+      integer istep
+      double precision  tmp, tmp2, tv(5*isiz1*isiz2)
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+ 
+c---------------------------------------------------------------------
+c   begin pseudo-time stepping iterations
+c---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+c---------------------------------------------------------------------
+c   initialize a,b,c,d to zero (guarantees that page tables have been
+c   formed, if applicable on given architecture, before timestepping).
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(m,n,i,j)
+!$omp do
+      do j=jst,jend
+         do i=ist,iend
+            do m=1,5
+               do n=1,5
+                  a(m,n,i,j) = 0.d0
+                  b(m,n,i,j) = 0.d0
+                  c(m,n,i,j) = 0.d0
+                  d(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp do
+      do j=jend,jst,-1
+         do i=iend,ist,-1
+            do m=1,5
+               do n=1,5
+                  au(m,n,i,j) = 0.d0
+                  bu(m,n,i,j) = 0.d0
+                  cu(m,n,i,j) = 0.d0
+                  du(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+      call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the L2 norms of newton iteration residuals
+c---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >             ist, iend, jst, jend,
+     >             rsd, rsdnm )
+
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call timer_start(1)
+ 
+c---------------------------------------------------------------------
+c   the timestep loop
+c---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (mod ( istep, 20) .eq. 0 .or.
+     >         istep .eq. itmax .or.
+     >         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+c---------------------------------------------------------------------
+c   perform SSOR iteration
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,tmp2,lst,lend)
+!$omp&  shared(ist,iend,jst,jend,nx,ny,nz,nx0,ny0,omega)
+!$omp master
+         if (timeron) call timer_start(t_rhs)
+!$omp end master
+         tmp2 = dt
+!$omp do
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = tmp2 * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_rhs)
+!$omp end master
+
+         lst = ist + jst
+         lend = iend + jend
+!$omp barrier
+
+         do k = 2, nz -1 
+c---------------------------------------------------------------------
+c   form the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+!$omp master
+            if (timeron) call timer_start(t_jacld)
+!$omp end master
+            call jacld(k)
+!$omp master
+            if (timeron) call timer_stop(t_jacld)
+ 
+c---------------------------------------------------------------------
+c   perform the lower triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_blts)
+!$omp end master
+            call blts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, 
+     >                 a, b, c, d,
+     >                 ist, iend, jst, jend, 
+     >                 lst, lend )
+!$omp master
+            if (timeron) call timer_stop(t_blts)
+!$omp end master
+          end do
+
+
+          do k = nz - 1, 2, -1
+c---------------------------------------------------------------------
+c   form the strictly upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+!$omp master
+            if (timeron) call timer_start(t_jacu)
+!$omp end master
+            call jacu(k)
+!$omp master
+            if (timeron) call timer_stop(t_jacu)
+
+c---------------------------------------------------------------------
+c   perform the upper triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_buts)
+!$omp end master
+            call buts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, tv,
+     >                 du, au, bu, cu,
+     >                 ist, iend, jst, jend,
+     >                 lst, lend )
+!$omp master
+            if (timeron) call timer_stop(t_buts)
+!$omp end master
+          end do
+
+ 
+c---------------------------------------------------------------------
+c   update the variables
+c---------------------------------------------------------------------
+
+!$omp master
+         if (timeron) call timer_start(t_add)
+!$omp end master
+         tmp2 = tmp
+!$omp do
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )
+     >                    + tmp2 * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp end parallel
+         if (timeron) call timer_stop(t_add)
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration corrections
+c---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1006) ( delunm(m), m = 1, 5 )
+c            else if ( ipr .eq. 2 ) then
+c                write (*,'(i5,f15.6)') istep,delunm(5)
+c            end if
+         end if
+ 
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+         call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration residuals
+c---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.
+     >        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1007) ( rsdnm(m), m = 1, 5 )
+c            end if
+         end if
+
+c---------------------------------------------------------------------
+c   check the newton-iteration residuals against the tolerance levels
+c---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.
+     >        ( rsdnm(2) .lt. tolrsd(2) ) .and.
+     >        ( rsdnm(3) .lt. tolrsd(3) ) .and.
+     >        ( rsdnm(4) .lt. tolrsd(4) ) .and.
+     >        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+c            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+c            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,
+     >   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/syncs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/syncs.f
new file mode 100644
index 0000000..6690a3a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/syncs.f
@@ -0,0 +1,77 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine sync_left( ldmx, ldmy, ldmz, v )
+
+c---------------------------------------------------------------------
+c   Thread synchronization for pipeline operation
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer ldmx, ldmy, ldmz
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer isync(0:isiz2), mthreadnum, iam
+      common /threadinfo1/ isync
+      common /threadinfo2/ mthreadnum, iam
+!$omp threadprivate(/threadinfo2/)
+
+      integer neigh
+
+
+      if (iam .gt. 0 .and. iam .le. mthreadnum) then
+         neigh = iam - 1
+         do while (isync(neigh) .eq. 0)
+!$omp flush(isync)
+         end do
+         isync(neigh) = 0
+!$omp flush(isync,v)
+      endif
+
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine sync_right( ldmx, ldmy, ldmz, v )
+
+c---------------------------------------------------------------------
+c   Thread synchronization for pipeline operation
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer ldmx, ldmy, ldmz
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer isync(0:isiz2), mthreadnum, iam
+      common /threadinfo1/ isync
+      common /threadinfo2/ mthreadnum, iam
+!$omp threadprivate(/threadinfo2/)
+
+
+      if (iam .lt. mthreadnum) then
+!$omp flush(isync,v)
+         do while (isync(iam) .eq. 1)
+!$omp flush(isync)
+         end do
+         isync(iam) = 1
+!$omp flush(isync)
+      endif
+
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/verify.f
new file mode 100644
index 0000000..0628800
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/LU/verify.f
@@ -0,0 +1,408 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(xcr, xce, xci, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        implicit none
+        include 'applu.incl'
+
+        double precision xcr(5), xce(5), xci
+        double precision xcrref(5),xceref(5),xciref, 
+     >                   xcrdif(5),xcedif(5),xcidif,
+     >                   epsilon, dtref
+        integer m
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+        xciref = 1.0
+
+        if ( (nx0  .eq. 12     ) .and. 
+     >       (ny0  .eq. 12     ) .and.
+     >       (nz0  .eq. 12     ) .and.
+     >       (itmax   .eq. 50    ))  then
+
+           class = 'S'
+           dtref = 5.0d-1
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xcrref(1) = 1.6196343210976702d-02
+         xcrref(2) = 2.1976745164821318d-03
+         xcrref(3) = 1.5179927653399185d-03
+         xcrref(4) = 1.5029584435994323d-03
+         xcrref(5) = 3.4264073155896461d-02
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xceref(1) = 6.4223319957960924d-04
+         xceref(2) = 8.4144342047347926d-05
+         xceref(3) = 5.8588269616485186d-05
+         xceref(4) = 5.8474222595157350d-05
+         xceref(5) = 1.3103347914111294d-03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (12X12X12) grid,
+c   after 50 time steps, with DT = 5.0d-01
+c---------------------------------------------------------------------
+         xciref = 7.8418928865937083d+00
+
+
+        elseif ( (nx0 .eq. 33) .and. 
+     >           (ny0 .eq. 33) .and.
+     >           (nz0 .eq. 33) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'W'   !SPEC95fp size
+           dtref = 1.5d-3
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (33x33x33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xcrref(1) =   0.1236511638192d+02
+           xcrref(2) =   0.1317228477799d+01
+           xcrref(3) =   0.2550120713095d+01
+           xcrref(4) =   0.2326187750252d+01
+           xcrref(5) =   0.2826799444189d+02
+
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (33X33X33) grid,
+c---------------------------------------------------------------------
+           xceref(1) =   0.4867877144216d+00
+           xceref(2) =   0.5064652880982d-01
+           xceref(3) =   0.9281818101960d-01
+           xceref(4) =   0.8570126542733d-01
+           xceref(5) =   0.1084277417792d+01
+
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (33X33X33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xciref    =   0.1161399311023d+02
+
+        elseif ( (nx0 .eq. 64) .and. 
+     >           (ny0 .eq. 64) .and.
+     >           (nz0 .eq. 64) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'A'
+           dtref = 2.0d+0
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 7.7902107606689367d+02
+         xcrref(2) = 6.3402765259692870d+01
+         xcrref(3) = 1.9499249727292479d+02
+         xcrref(4) = 1.7845301160418537d+02
+         xcrref(5) = 1.8384760349464247d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.9964085685471943d+01
+         xceref(2) = 2.8194576365003349d+00
+         xceref(3) = 7.3473412698774742d+00
+         xceref(4) = 6.7139225687777051d+00
+         xceref(5) = 7.0715315688392578d+01
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (64X64X64) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 2.6030925604886277d+01
+
+
+        elseif ( (nx0 .eq. 102) .and. 
+     >           (ny0 .eq. 102) .and.
+     >           (nz0 .eq. 102) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'B'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (102X102X102) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 3.5532672969982736d+03
+         xcrref(2) = 2.6214750795310692d+02
+         xcrref(3) = 8.8333721850952190d+02
+         xcrref(4) = 7.7812774739425265d+02
+         xcrref(5) = 7.3087969592545314d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (102X102X102) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 1.1401176380212709d+02
+         xceref(2) = 8.1098963655421574d+00
+         xceref(3) = 2.8480597317698308d+01
+         xceref(4) = 2.5905394567832939d+01
+         xceref(5) = 2.6054907504857413d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (102X102X102) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 4.7887162703308227d+01
+
+        elseif ( (nx0 .eq. 162) .and. 
+     >           (ny0 .eq. 162) .and.
+     >           (nz0 .eq. 162) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'C'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (162X162X162) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 1.03766980323537846d+04
+         xcrref(2) = 8.92212458801008552d+02
+         xcrref(3) = 2.56238814582660871d+03
+         xcrref(4) = 2.19194343857831427d+03
+         xcrref(5) = 1.78078057261061185d+04
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (162X162X162) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.15986399716949279d+02
+         xceref(2) = 1.55789559239863600d+01
+         xceref(3) = 5.41318863077207766d+01
+         xceref(4) = 4.82262643154045421d+01
+         xceref(5) = 4.55902910043250358d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (162X162X162) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (162X162X162) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+        elseif ( (nx0 .eq. 408) .and. 
+     >           (ny0 .eq. 408) .and.
+     >           (nz0 .eq. 408) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'D'
+           dtref = 1.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (408X408X408) grid,
+c   after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.4868417937025d+05
+         xcrref(2) = 0.4696371050071d+04
+         xcrref(3) = 0.1218114549776d+05 
+         xcrref(4) = 0.1033801493461d+05
+         xcrref(5) = 0.7142398413817d+05
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (408X408X408) 
+c   grid, after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.3752393004482d+03
+         xceref(2) = 0.3084128893659d+02
+         xceref(3) = 0.9434276905469d+02
+         xceref(4) = 0.8230686681928d+02
+         xceref(5) = 0.7002620636210d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (408X408X408) grid,
+c   after 300 time steps, with DT = 1.0d+00
+c---------------------------------------------------------------------
+         xciref =    0.8334101392503d+02
+
+        elseif ( (nx0 .eq. 1020) .and. 
+     >           (ny0 .eq. 1020) .and.
+     >           (nz0 .eq. 1020) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'E'
+           dtref = 0.5d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (1020X1020X1020) grid,
+c   after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2099641687874d+06
+         xcrref(2) = 0.2130403143165d+05
+         xcrref(3) = 0.5319228789371d+05 
+         xcrref(4) = 0.4509761639833d+05
+         xcrref(5) = 0.2932360006590d+06
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (1020X1020X1020) 
+c   grid, after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.4800572578333d+03
+         xceref(2) = 0.4221993400184d+02
+         xceref(3) = 0.1210851906824d+03
+         xceref(4) = 0.1047888986770d+03
+         xceref(5) = 0.8363028257389d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (1020X1020X1020) grid,
+c   after 300 time steps, with DT = 0.5d+00
+c---------------------------------------------------------------------
+         xciref =    0.9512163272273d+02
+
+        else
+           verified = .FALSE.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+        xcidif = dabs((xci - xciref)/xciref)
+
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(/, ' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' Accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, 2x, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, 2x, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, 2x, E20.13)
+        
+        if (class .ne. 'U') then
+           write (*,2025)
+        else
+           write (*,2026)
+        endif
+ 2025   format(' Comparison of surface integral')
+ 2026   format(' Surface integral')
+
+
+        if (class .eq. 'U') then
+           write(*, 2030) xci
+        else if (xcidif .le. epsilon) then
+           write(*, 2032) xci, xciref, xcidif
+        else
+           verified = .false.
+           write(*, 2031) xci, xciref, xcidif
+        endif
+
+ 2030   format('          ', 4x, E20.13)
+ 2031   format(' FAILURE: ', 4x, E20.13, E20.13, E20.13)
+ 2032   format('          ', 4x, E20.13, E20.13, E20.13)
+
+
+
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/Makefile
new file mode 100644
index 0000000..ef0fff9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/Makefile
@@ -0,0 +1,28 @@
+SHELL=/bin/sh
+BENCHMARK=mg
+BENCHMARKU=MG
+
+include ../config/make.def
+
+OBJS = mg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+mg.o:		mg.f globals.h npbparams.h
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS mg.f
+else
+	${FCOMPILE} mg.f
+endif
+
+clean:
+	- rm -f *.o *~
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/README
new file mode 100644
index 0000000..566d71d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/README
@@ -0,0 +1,141 @@
+Some info about the MG benchmark
+(Note: this info applies to the parallel version and mostly concerns
+the processor decomposition.  Info not concerning the decomposition
+still applies to the serial version.)
+================================
+    
+'mg_demo' demonstrates the capabilities of a very simple multigrid
+solver in computing a three dimensional potential field.  This is
+a simplified multigrid solver in two important respects:
+
+  (1) it solves only a constant coefficient equation,
+  and that only on a uniform cubical grid,
+    
+  (2) it solves only a single equation, representing
+  a scalar field rather than a vector field.
+
+We chose it for its portability and simplicity, and expect that a
+supercomputer which can run it effectively will also be able to
+run more complex multigrid programs at least as well.
+     
+     Eric Barszcz                         Paul Frederickson
+     RIACS
+     NASA Ames Research Center            NASA Ames Research Center
+
+========================================================================
+Running the program:  (Note: also see parameter lm information in the
+                       two sections immediately below this section)
+
+The program may be run with or without an input deck (called "mg.input"). 
+The following describes a few things about the input deck if you want to 
+use one. 
+
+The four lines below are the "mg.input" file required to run a
+problem of total size 256x256x256, for 4 iterations (Class "A"),
+and presumes the use of 8 processors:
+
+   8 = top level
+   256 256 256 = nx ny nz
+   4 = nit
+   0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+8 processors are solving this problem (recall that the number of 
+processors is specified to MPI as a run parameter, and MPI subsequently
+determines this for the code via an MPI subroutine call), a 2x2x2 
+processor grid is  formed, and thus each partition on a processor is 
+of size 128x128x128.  Therefore, a maximum of 8 multi-grid levels may 
+be used.  These are of size 128,64,32,16,8,4,2,1, with the coarsest 
+level being a single point on a given processor.
+
+
+Next, consider the same size problem but running on 1 processor.  The
+following "mg.input" file is appropriate:
+
+    9 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+Since this processor must solve the full 256x256x256 problem, this
+permits 9 multi-grid levels (256,128,64,32,16,8,4,2,1), resulting in 
+a coarsest multi-grid level of a single point on the processor
+
+
+Next, consider the same size problem but running on 2 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The algorithm for partitioning the full grid onto some power of 2 number 
+of processors is to start by splitting the last dimension of the grid
+(z dimension) in 2: the problem is now partitioned onto 2 processors.
+Next the middle dimension (y dimension) is split in 2: the problem is now
+partitioned onto 4 processors.  Next, first dimension (x dimension) is
+split in 2: the problem is now partitioned onto 8 processors.  Next, the
+last dimension (z dimension) is split again in 2: the problem is now
+partitioned onto 16 processors.  This partitioning is repeated until all 
+of the power of 2 processors have been allocated.
+
+Thus to run the above problem on 2 processors, the grid partitioning 
+algorithm will allocate the two processors across the last dimension, 
+creating two partitions each of size 256x256x128. The coarsest level of 
+multi-grid must be a single point surrounded by a cubic number of grid 
+points.  Therefore, each of the two processor partitions will contain 4 
+coarsest multi-grid level points, each surrounded by a cube of grid points 
+of size 128x128x128, indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 4 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The partitioning algorithm will create 4 partitions, each of size
+256x128x128.  Each partition will contain 2 coarsest multi-grid level
+points each surrounded by a cube of grid points of size 128x128x128, 
+indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 16 processors.  The
+following "mg.input" file is required:
+
+    7 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+On each node a partition of size 128x128x64 will be created.  A maximum
+of 7 multi-grid levels (64,32,16,8,4,2,1) may be used, resulting in each 
+partions containing 4 coarsest multi-grid level points, each surrounded 
+by a cube of grid points of size 64x64x64, indicated by a top level of 7.
+
+
+
+
+Note that non-cubic problem sizes may also be considered:
+
+The four lines below are the "mg.input" file appropriate for running a
+problem of total size 256x512x512, for 20 iterations and presumes the 
+use of 32 processors (note: this is NOT a class C problem):
+
+    8 = top level
+    256 512 512 = nx ny nz
+    20 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+32 processors are solving this problem, a 2x4x4 processor grid is
+formed, and thus each partition on a processor is of size 128x128x128.
+Therefore, a maximum of 8 multi-grid levels may be used.  These are of
+size 128,64,32,16,8,4,2,1, with the coarsest level being a single 
+point on a given processor.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/globals.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/globals.h
new file mode 100644
index 0000000..6179eaa
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/globals.h
@@ -0,0 +1,51 @@
+c---------------------------------------------------------------------
+c  Parameter lm (declared and set in "npbparams.h") is the log-base2 of 
+c  the edge size max for the partition on a given node, so must be changed 
+c  either to save space (if running a small case) or made bigger for larger 
+c  cases, for example, 512^3. Thus lm=7 means that the largest dimension 
+c  of a partition that can be solved on a node is 2^7 = 128. lm is set 
+c  automatically in npbparams.h
+c  Parameters ndim1, ndim2, ndim3 are the local problem dimensions. 
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer nm      ! actual dimension including ghost cells for communications
+c  ***  type of nv, nr and ir is set in npbparams.h
+c     >      , nv      ! size of rhs array
+c     >      , nr      ! size of residual array
+     >      , maxlevel! maximum number of levels
+
+      parameter( nm=2+2**lm, maxlevel=(lt_default+1) )
+      parameter( nv=one*(2+2**ndim1)*(2+2**ndim2)*(2+2**ndim3) )
+      parameter( nr = ((nv+nm**2+5*nm+7*lm+6)/7)*8 )
+c---------------------------------------------------------------------
+      integer  nx(maxlevel),ny(maxlevel),nz(maxlevel)
+      common /mg3/ nx,ny,nz
+
+      character class
+      common /ClassType/class
+
+      integer debug_vec(0:7)
+      common /my_debug/ debug_vec
+
+      integer m1(maxlevel), m2(maxlevel), m3(maxlevel)
+      integer lt, lb
+      common /fap/ ir(maxlevel),m1,m2,m3,lt,lb
+
+c---------------------------------------------------------------------
+c  Set at m=1024, can handle cases up to 1024^3 case
+c---------------------------------------------------------------------
+      integer m
+c      parameter( m=1037 )
+      parameter( m=nm+1 )
+
+      logical timeron
+      common /timers/ timeron
+      integer T_init, T_bench, T_psinv, T_resid, T_rprj3, T_interp,
+     >        T_norm2, T_mg3P, T_resid2, T_comm3, T_last
+      parameter (T_init=1, T_bench=2, T_mg3P=3,
+     >        T_psinv=4, T_resid=5, T_resid2=6, T_rprj3=7,
+     >        T_interp=8, T_norm2=9, T_comm3=10, T_last=10)
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.f
new file mode 100644
index 0000000..ce339d0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.f
@@ -0,0 +1,1452 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   M G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB MG code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: E. Barszcz
+c          P. Frederickson
+c          A. Woo
+c          M. Yarrow
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+      program mg
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'globals.h'
+
+c---------------------------------------------------------------------------c
+c k is the current level. It is passed down through subroutine args
+c and is NOT global. it is the current iteration
+c---------------------------------------------------------------------------c
+
+      integer k, it
+      
+      external timer_read
+      double precision t, tinit, mflops, timer_read
+
+c---------------------------------------------------------------------------c
+c These arrays are in common because they are quite large
+c and probably shouldn't be allocated on the stack. They
+c are always passed as subroutine args. 
+c---------------------------------------------------------------------------c
+
+      double precision u(nr),v(nv),r(nr),a(0:3),c(0:3)
+      common /noautom/ u,v,r   
+
+      double precision rnm2, rnmu, old2, oldu, epsilon
+      integer n1, n2, n3, nit
+      double precision nn, verify_value, err
+      logical verified
+
+      integer i, fstatus
+      character t_names(t_last)*8
+      double precision tmax
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+
+      do i = T_init, T_last
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_init)
+
+c---------------------------------------------------------------------
+c Read in and broadcast input data
+c---------------------------------------------------------------------
+
+      open(unit=7,file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_mg3P) = 'mg3P'
+         t_names(t_psinv) = 'psinv'
+         t_names(t_resid) = 'resid'
+         t_names(t_rprj3) = 'rprj3'
+         t_names(t_interp) = 'interp'
+         t_names(t_norm2) = 'norm2'
+         t_names(t_comm3) = 'comm3'
+         close(7)
+      else
+         timeron = .false.
+      endif
+
+      write (*, 1000) 
+
+      open(unit=7,file="mg.input", status="old", iostat=fstatus)
+      if (fstatus .eq. 0) then
+         write(*,50) 
+ 50      format(' Reading from input file mg.input')
+         read(7,*) lt
+         read(7,*) nx(lt), ny(lt), nz(lt)
+         read(7,*) nit
+         read(7,*) (debug_vec(i),i=0,7)
+      else
+         write(*,51) 
+ 51      format(' No input file. Using compiled defaults ')
+         lt = lt_default
+         nit = nit_default
+         nx(lt) = nx_default
+         ny(lt) = ny_default
+         nz(lt) = nz_default
+         do i = 0,7
+            debug_vec(i) = debug_default
+         end do
+      endif
+
+
+      if ( (nx(lt) .ne. ny(lt)) .or. (nx(lt) .ne. nz(lt)) ) then
+         Class = 'U' 
+      else if( nx(lt) .eq. 32 .and. nit .eq. 4 ) then
+         Class = 'S'
+      else if( nx(lt) .eq. 128 .and. nit .eq. 4 ) then
+         Class = 'W'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 4 ) then  
+         Class = 'A'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 20 ) then
+         Class = 'B'
+      else if( nx(lt) .eq. 512 .and. nit .eq. 20 ) then  
+         Class = 'C'
+      else if( nx(lt) .eq. 1024 .and. nit .eq. 50 ) then  
+         Class = 'D'
+      else if( nx(lt) .eq. 2048 .and. nit .eq. 50 ) then  
+         Class = 'E'
+      else
+         Class = 'U'
+      endif
+
+c---------------------------------------------------------------------
+c  Use these for debug info:
+c---------------------------------------------------------------------
+c     debug_vec(0) = 1 !=> report all norms
+c     debug_vec(1) = 1 !=> some setup information
+c     debug_vec(1) = 2 !=> more setup information
+c     debug_vec(2) = k => at level k or below, show result of resid
+c     debug_vec(3) = k => at level k or below, show result of psinv
+c     debug_vec(4) = k => at level k or below, show result of rprj
+c     debug_vec(5) = k => at level k or below, show result of interp
+c     debug_vec(6) = 1 => (unused)
+c     debug_vec(7) = 1 => (unused)
+c---------------------------------------------------------------------
+      a(0) = -8.0D0/3.0D0 
+      a(1) =  0.0D0 
+      a(2) =  1.0D0/6.0D0 
+      a(3) =  1.0D0/12.0D0
+      
+      if(Class .eq. 'A' .or. Class .eq. 'S'.or. Class .eq.'W') then
+c---------------------------------------------------------------------
+c     Coefficients for the S(a) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/8.0D0
+         c(1) =  +1.0D0/32.0D0
+         c(2) =  -1.0D0/64.0D0
+         c(3) =   0.0D0
+      else
+c---------------------------------------------------------------------
+c     Coefficients for the S(b) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/17.0D0
+         c(1) =  +1.0D0/33.0D0
+         c(2) =  -1.0D0/61.0D0
+         c(3) =   0.0D0
+      endif
+      lb = 1
+      k  = lt
+
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call norm2u3(v,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+c     write(*,*)
+c     write(*,*)' norms of random v are'
+c     write(*,600) 0, rnm2, rnmu
+c     write(*,*)' about to evaluate resid, k=',k
+
+      write (*, 1001) nx(lt),ny(lt),nz(lt), Class
+      write (*, 1002) nit
+!$    write (*, 1003) omp_get_max_threads()
+      write (*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - MG Benchmark', /)
+ 1001 format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', A, ')' )
+ 1002 format(' Iterations:                  ', i5)
+ 1003 format(' Number of available threads: ', i5)
+
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+c---------------------------------------------------------------------
+c     One iteration for startup
+c---------------------------------------------------------------------
+      call mg3P(u,v,r,a,c,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call timer_stop(T_init)
+      tinit = timer_read(T_init)
+
+      write( *,'(A,F15.3,A/)' ) 
+     >     ' Initialization time: ',tinit, ' seconds'
+
+#ifdef HOOKS
+      call roi_begin
+#endif
+
+      do i = T_bench, T_last
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_bench)
+
+      if (timeron) call timer_start(T_resid2)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      if (timeron) call timer_stop(T_resid2)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+      do  it=1,nit
+         if (it.eq.1 .or. it.eq.nit .or. mod(it,5).eq.0) then
+            write(*,80) it
+   80       format('  iter ',i3)
+         endif
+         if (timeron) call timer_start(T_mg3P)
+         call mg3P(u,v,r,a,c,n1,n2,n3,k)
+         if (timeron) call timer_stop(T_mg3P)
+         if (timeron) call timer_start(T_resid2)
+         call resid(u,v,r,n1,n2,n3,a,k)
+         if (timeron) call timer_stop(T_resid2)
+      enddo
+
+
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      call timer_stop(T_bench)
+
+      t = timer_read(T_bench)
+
+#ifdef HOOKS
+      call roi_end
+#endif
+
+      verified = .FALSE.
+      verify_value = 0.0
+
+      write(*,100)
+ 100  format(/' Benchmark completed ')
+
+      epsilon = 1.d-8
+      if (Class .ne. 'U') then
+         if(Class.eq.'S') then
+            verify_value = 0.5307707005734d-04
+         elseif(Class.eq.'W') then
+            verify_value = 0.6467329375339d-05
+         elseif(Class.eq.'A') then
+            verify_value = 0.2433365309069d-05
+         elseif(Class.eq.'B') then
+            verify_value = 0.1800564401355d-05
+         elseif(Class.eq.'C') then
+            verify_value = 0.5706732285740d-06
+         elseif(Class.eq.'D') then
+            verify_value = 0.1583275060440d-09
+         elseif(Class.eq.'E') then
+            verify_value = 0.5630442584711d-10
+         endif
+
+         err = abs( rnm2 - verify_value ) / verify_value
+         if( err .le. epsilon ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) rnm2
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' L2 Norm is ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300) 
+            write(*, 301) rnm2
+            write(*, 302) verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' L2 Norm is             ', E20.13)
+ 302        format(' The correct L2 Norm is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) rnm2
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+      nn = 1.0d0*nx(lt)*ny(lt)*nz(lt)
+
+      if( t .ne. 0. ) then
+         mflops = 58.*nit*nn*1.0D-6 /t
+      else
+         mflops = 0.0
+      endif
+
+      call print_results('MG', class, nx(lt), ny(lt), nz(lt), 
+     >                   nit, t,
+     >                   mflops, '          floating point', 
+     >                   verified, npbversion, compiletime,
+     >                   cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+ 600  format( i4, 2e19.12)
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(t_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=t_bench, t_last
+         t = timer_read(i)
+         if (i.eq.t_resid2) then
+            t = timer_read(T_resid) - t
+            write(*,820) 'mg-resid', t, t*100./tmax
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup(n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'globals.h'
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1,n2,n3,k
+      integer j
+
+      integer ax, mi(3,maxlevel)
+      integer ng(3,maxlevel)
+
+
+      ng(1,lt) = nx(lt)
+      ng(2,lt) = ny(lt)
+      ng(3,lt) = nz(lt)
+      do  ax=1,3
+         do  k=lt-1,1,-1
+            ng(ax,k) = ng(ax,k+1)/2
+         enddo
+      enddo
+ 61   format(10i4)
+      do  k=lt,1,-1
+         nx(k) = ng(1,k)
+         ny(k) = ng(2,k)
+         nz(k) = ng(3,k)
+      enddo
+
+      do  k = lt,1,-1
+         do  ax = 1,3
+            mi(ax,k) = 2 + ng(ax,k) 
+         enddo
+
+         m1(k) = mi(1,k)
+         m2(k) = mi(2,k)
+         m3(k) = mi(3,k)
+
+      enddo
+
+      k = lt
+      is1 = 2 + ng(1,k) - ng(1,lt)
+      ie1 = 1 + ng(1,k)
+      n1 = 3 + ie1 - is1
+      is2 = 2 + ng(2,k) - ng(2,lt)
+      ie2 = 1 + ng(2,k) 
+      n2 = 3 + ie2 - is2
+      is3 = 2 + ng(3,k) - ng(3,lt)
+      ie3 = 1 + ng(3,k) 
+      n3 = 3 + ie3 - is3
+
+
+      ir(lt)=1
+      do  j = lt-1, 1, -1
+         ir(j)=ir(j+1)+one*m1(j+1)*m2(j+1)*m3(j+1)
+      enddo
+
+
+      if( debug_vec(1) .ge. 1 )then
+         write(*,*)' in setup, '
+         write(*,*)' k  lt  nx  ny  nz ',
+     >        ' n1  n2  n3 is1 is2 is3 ie1 ie2 ie3'
+         write(*,9) k,lt,ng(1,k),ng(2,k),ng(3,k),
+     >              n1,n2,n3,is1,is2,is3,ie1,ie2,ie3
+ 9       format(15i4)
+      endif
+
+      k = lt
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine mg3P(u,v,r,a,c,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multigrid V-cycle routine
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, k
+      double precision u(nr),v(nv),r(nr)
+      double precision a(0:3),c(0:3)
+
+      integer j
+
+c---------------------------------------------------------------------
+c     down cycle.
+c     restrict the residual from the find grid to the coarse
+c---------------------------------------------------------------------
+
+      do  k= lt, lb+1 , -1
+         j = k-1
+         call rprj3(r(ir(k)),m1(k),m2(k),m3(k),
+     >        r(ir(j)),m1(j),m2(j),m3(j),k)
+      enddo
+
+      k = lb
+c---------------------------------------------------------------------
+c     compute an approximate solution on the coarsest grid
+c---------------------------------------------------------------------
+      call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+      call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+
+      do  k = lb+1, lt-1     
+          j = k-1
+c---------------------------------------------------------------------
+c        prolongate from level k-1  to k
+c---------------------------------------------------------------------
+         call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+         call interp(u(ir(j)),m1(j),m2(j),m3(j),
+     >               u(ir(k)),m1(k),m2(k),m3(k),k)
+c---------------------------------------------------------------------
+c        compute residual for level k
+c---------------------------------------------------------------------
+         call resid(u(ir(k)),r(ir(k)),r(ir(k)),m1(k),m2(k),m3(k),a,k)
+c---------------------------------------------------------------------
+c        apply smoother
+c---------------------------------------------------------------------
+         call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+      enddo
+ 200  continue
+      j = lt - 1
+      k = lt
+      call interp(u(ir(j)),m1(j),m2(j),m3(j),u,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call psinv(r,u,n1,n2,n3,c,k)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine psinv( r,u,n1,n2,n3,c,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     psinv applies an approximate inverse as smoother:  u = u + Cr
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Presuming coefficient c(3) is zero (the NPB assumes this,
+c     but it is thus not a general case), 2A + 1M may be eliminated,
+c     resulting in 13A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),r(n1,n2,n3),c(0:3)
+      integer i3, i2, i1
+
+      double precision r1(m), r2(m)
+
+      if (timeron) call timer_start(T_psinv)
+!$omp parallel do default(shared) private(i1,i2,i3,r1,r2)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3)
+     >                + r(i1,i2,i3-1) + r(i1,i2,i3+1)
+               r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)
+     >                + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               u(i1,i2,i3) = u(i1,i2,i3)
+     >                     + c(0) * r(i1,i2,i3)
+     >                     + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3)
+     >                              + r1(i1) )
+     >                     + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) )
+c---------------------------------------------------------------------
+c  Assume c(3) = 0    (Enable line below if c(3) not= 0)
+c---------------------------------------------------------------------
+c    >                     + c(3) * ( r2(i1-1) + r2(i1+1) )
+c---------------------------------------------------------------------
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_psinv)
+
+c---------------------------------------------------------------------
+c     exchange boundary points
+c---------------------------------------------------------------------
+      call comm3(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(u,n1,n2,n3,'   psinv',k)
+      endif
+
+      if( debug_vec(3) .ge. k )then
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine resid( u,v,r,n1,n2,n3,a,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     resid computes the residual:  r = v - Au
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition (or Subtraction) and 
+c     Multiplication, respectively. 
+c     Presuming coefficient a(1) is zero (the NPB assumes this,
+c     but it is thus not a general case), 3A + 1M may be eliminated,
+c     resulting in 12A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
+      integer i3, i2, i1
+      double precision u1(m), u2(m)
+
+      if (timeron) call timer_start(T_resid)
+!$omp parallel do default(shared) private(i1,i2,i3,u1,u2)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
+     >                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
+               u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
+     >                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               r(i1,i2,i3) = v(i1,i2,i3)
+     >                     - a(0) * u(i1,i2,i3)
+c---------------------------------------------------------------------
+c  Assume a(1) = 0      (Enable 2 lines below if a(1) not= 0)
+c---------------------------------------------------------------------
+c    >                     - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
+c    >                              + u1(i1) )
+c---------------------------------------------------------------------
+     >                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
+     >                     - a(3) * ( u2(i1-1) + u2(i1+1) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_resid)
+
+c---------------------------------------------------------------------
+c     exchange boundary data
+c---------------------------------------------------------------------
+      call comm3(r,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(r,n1,n2,n3,'   resid',k)
+      endif
+
+      if( debug_vec(2) .ge. k )then
+         call showall(r,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rprj3( r,m1k,m2k,m3k,s,m1j,m2j,m3j,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rprj3 projects onto the next coarser grid, 
+c     using a trilinear Finite Element projection:  s = r' = P r
+c     
+c     This  implementation costs  20A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer m1k, m2k, m3k, m1j, m2j, m3j,k
+      double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j)
+      integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j
+
+      double precision x1(m), y1(m), x2,y2
+
+      if (timeron) call timer_start(T_rprj3)
+      if(m1k.eq.3)then
+        d1 = 2
+      else
+        d1 = 1
+      endif
+
+      if(m2k.eq.3)then
+        d2 = 2
+      else
+        d2 = 1
+      endif
+
+      if(m3k.eq.3)then
+        d3 = 2
+      else
+        d3 = 1
+      endif
+
+!$omp parallel do default(shared)
+!$omp& private(j1,j2,j3,i1,i2,i3,x1,y1,x2,y2)
+      do  j3=2,m3j-1
+         i3 = 2*j3-d3
+         do  j2=2,m2j-1
+            i2 = 2*j2-d2
+
+            do j1=2,m1j
+              i1 = 2*j1-d1
+              x1(i1-1) = r(i1-1,i2-1,i3  ) + r(i1-1,i2+1,i3  )
+     >                 + r(i1-1,i2,  i3-1) + r(i1-1,i2,  i3+1)
+              y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1)
+     >                 + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1)
+            enddo
+
+            do  j1=2,m1j-1
+              i1 = 2*j1-d1
+              y2 = r(i1,  i2-1,i3-1) + r(i1,  i2-1,i3+1)
+     >           + r(i1,  i2+1,i3-1) + r(i1,  i2+1,i3+1)
+              x2 = r(i1,  i2-1,i3  ) + r(i1,  i2+1,i3  )
+     >           + r(i1,  i2,  i3-1) + r(i1,  i2,  i3+1)
+              s(j1,j2,j3) =
+     >               0.5D0 * r(i1,i2,i3)
+     >             + 0.25D0 * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2)
+     >             + 0.125D0 * ( x1(i1-1) + x1(i1+1) + y2)
+     >             + 0.0625D0 * ( y1(i1-1) + y1(i1+1) )
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_rprj3)
+
+
+      j = k-1
+      call comm3(s,m1j,m2j,m3j,j)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(s,m1j,m2j,m3j,'   rprj3',k-1)
+      endif
+
+      if( debug_vec(4) .ge. k )then
+         call showall(s,m1j,m2j,m3j)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine interp( z,mm1,mm2,mm3,u,n1,n2,n3,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     interp adds the trilinear interpolation of the correction
+c     from the coarser grid to the current approximation:  u = u + Qu'
+c     
+c     Observe that this  implementation costs  16A + 4M, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  Vector machines may get slightly better 
+c     performance however, with 8 separate "do i1" loops, rather than 4.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer mm1, mm2, mm3, n1, n2, n3,k
+      double precision z(mm1,mm2,mm3),u(n1,n2,n3)
+      integer i3, i2, i1, d1, d2, d3, t1, t2, t3
+
+c note that m = 1037 in globals.h but for this only need to be
+c 535 to handle up to 1024^3
+c      integer m
+c      parameter( m=535 )
+      double precision z1(m),z2(m),z3(m)
+
+      if (timeron) call timer_start(T_interp)
+      if( n1 .ne. 3 .and. n2 .ne. 3 .and. n3 .ne. 3 ) then
+
+!$omp parallel do default(shared) private(i1,i2,i3,z1,z2,z3)
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+
+               do i1=1,mm1
+                  z1(i1) = z(i1,i2+1,i3) + z(i1,i2,i3)
+                  z2(i1) = z(i1,i2,i3+1) + z(i1,i2,i3)
+                  z3(i1) = z(i1,i2+1,i3+1) + z(i1,i2,i3+1) + z1(i1)
+               enddo
+
+               do  i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3-1)=u(2*i1-1,2*i2-1,2*i3-1)
+     >                 +z(i1,i2,i3)
+                  u(2*i1,2*i2-1,2*i3-1)=u(2*i1,2*i2-1,2*i3-1)
+     >                 +0.5d0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3-1)=u(2*i1-1,2*i2,2*i3-1)
+     >                 +0.5d0 * z1(i1)
+                  u(2*i1,2*i2,2*i3-1)=u(2*i1,2*i2,2*i3-1)
+     >                 +0.25d0*( z1(i1) + z1(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3)=u(2*i1-1,2*i2-1,2*i3)
+     >                 +0.5d0 * z2(i1)
+                  u(2*i1,2*i2-1,2*i3)=u(2*i1,2*i2-1,2*i3)
+     >                 +0.25d0*( z2(i1) + z2(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3)=u(2*i1-1,2*i2,2*i3)
+     >                 +0.25d0* z3(i1)
+                  u(2*i1,2*i2,2*i3)=u(2*i1,2*i2,2*i3)
+     >                 +0.125d0*( z3(i1) + z3(i1+1) )
+               enddo
+            enddo
+         enddo
+
+      else
+
+         if(n1.eq.3)then
+            d1 = 2
+            t1 = 1
+         else
+            d1 = 1
+            t1 = 0
+         endif
+         
+         if(n2.eq.3)then
+            d2 = 2
+            t2 = 1
+         else
+            d2 = 1
+            t2 = 0
+         endif
+         
+         if(n3.eq.3)then
+            d3 = 2
+            t3 = 1
+         else
+            d3 = 1
+            t3 = 0
+         endif
+         
+!$omp parallel default(shared) private(i1,i2,i3)
+!$omp do
+         do  i3=d3,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-d3)=u(2*i1-d1,2*i2-d2,2*i3-d3)
+     >                 +z(i1,i2,i3)
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-d3)=u(2*i1-t1,2*i2-d2,2*i3-d3)
+     >                 +0.5D0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-d3)=u(2*i1-d1,2*i2-t2,2*i3-d3)
+     >                 +0.5D0*(z(i1,i2+1,i3)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-d3)=u(2*i1-t1,2*i2-t2,2*i3-d3)
+     >                 +0.25D0*(z(i1+1,i2+1,i3)+z(i1+1,i2,i3)
+     >                 +z(i1,  i2+1,i3)+z(i1,  i2,i3))
+               enddo
+            enddo
+         enddo
+
+!$omp do
+         do  i3=1,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-t3)=u(2*i1-d1,2*i2-d2,2*i3-t3)
+     >                 +0.5D0*(z(i1,i2,i3+1)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-t3)=u(2*i1-t1,2*i2-d2,2*i3-t3)
+     >                 +0.25D0*(z(i1+1,i2,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1+1,i2,i3  )+z(i1,i2,i3  ))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-t3)=u(2*i1-d1,2*i2-t2,2*i3-t3)
+     >                 +0.25D0*(z(i1,i2+1,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1,i2+1,i3  )+z(i1,i2,i3  ))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-t3)=u(2*i1-t1,2*i2-t2,2*i3-t3)
+     >                 +0.125D0*(z(i1+1,i2+1,i3+1)+z(i1+1,i2,i3+1)
+     >                 +z(i1  ,i2+1,i3+1)+z(i1  ,i2,i3+1)
+     >                 +z(i1+1,i2+1,i3  )+z(i1+1,i2,i3  )
+     >                 +z(i1  ,i2+1,i3  )+z(i1  ,i2,i3  ))
+               enddo
+            enddo
+         enddo
+!$omp end do nowait
+!$omp end parallel
+
+      endif
+      if (timeron) call timer_stop(T_interp)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(z,mm1,mm2,mm3,'z: inter',k-1)
+         call rep_nrm(u,n1,n2,n3,'u: inter',k)
+      endif
+
+      if( debug_vec(5) .ge. k )then
+         call showall(z,mm1,mm2,mm3)
+         call showall(u,n1,n2,n3)
+      endif
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine norm2u3(r,n1,n2,n3,rnm2,rnmu,nx,ny,nz)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     norm2u3 evaluates approximations to the L2 norm and the
+c     uniform (or L-infinity or Chebyshev) norm, under the
+c     assumption that the boundaries are periodic or zero.  Add the
+c     boundaries in with half weight (quarter weight on the edges
+c     and eighth weight at the corners) for inhomogeneous boundaries.
+c---------------------------------------------------------------------
+      implicit none
+
+
+      integer n1, n2, n3, nx, ny, nz
+      double precision rnm2, rnmu, r(n1,n2,n3)
+      double precision s, a
+      integer i3, i2, i1
+
+      double precision dn
+
+      logical timeron
+      common /timers/ timeron
+      integer T_norm2
+      parameter (T_norm2=9)
+
+      if (timeron) call timer_start(T_norm2)
+      dn = 1.0d0*nx*ny*nz
+
+      s=0.0D0
+      rnmu = 0.0D0
+!$omp parallel do default(shared) private(i1,i2,i3,a)
+!$omp& reduction(+:s) reduction(max:rnmu)
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               s=s+r(i1,i2,i3)**2
+               a=abs(r(i1,i2,i3))
+               rnmu=dmax1(rnmu,a)
+            enddo
+         enddo
+      enddo
+
+      rnm2=sqrt( s / dn )
+      if (timeron) call timer_stop(T_norm2)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rep_nrm(u,n1,n2,n3,title,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     report on norm
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      character*8 title
+
+      double precision rnm2, rnmu
+
+
+      call norm2u3(u,n1,n2,n3,rnm2,rnmu,nx(kk),ny(kk),nz(kk))
+      write(*,7)kk,title,rnm2,rnmu
+ 7    format(' Level',i2,' in ',a8,': norms =',D21.14,D21.14)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm3(u,n1,n2,n3,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     comm3 organizes the communication on all borders 
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer i1, i2, i3
+
+      if (timeron) call timer_start(T_comm3)
+!$omp parallel default(shared) private(i1,i2,i3)
+!$omp do
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            u( 1,i2,i3) = u(n1-1,i2,i3)
+            u(n1,i2,i3) = u(   2,i2,i3)
+         enddo
+c      enddo
+
+c      do  i3=2,n3-1
+         do  i1=1,n1
+            u(i1, 1,i3) = u(i1,n2-1,i3)
+            u(i1,n2,i3) = u(i1,   2,i3)
+         enddo
+      enddo
+
+!$omp do
+      do  i2=1,n2
+         do  i1=1,n1
+            u(i1,i2, 1) = u(i1,i2,n3-1)
+            u(i1,i2,n3) = u(i1,i2,   2)
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+      if (timeron) call timer_stop(T_comm3)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zran3(z,n1,n2,n3,nx1,ny1,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     zran3  loads +1 at ten randomly chosen points,
+c     loads -1 at a different ten random points,
+c     and zero elsewhere.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1, n2, n3, k, nx1, ny1, i0, mm0, mm1
+      double precision z(n1,n2,n3)
+
+      integer mm, i1, i2, i3, d1, e1, e2, e3
+      double precision x, a
+      double precision xx, x0, x1, a1, a2, ai, power
+      parameter( mm = 10,  a = 5.D0 ** 13, x = 314159265.D0)
+      double precision ten( mm, 0:1 ), best0, best1
+      integer i, j1( mm, 0:1 ), j2( mm, 0:1 ), j3( mm, 0:1 )
+      integer jg( 0:3, mm, 0:1 )
+
+      double precision starts(nm)
+      common /rans_save/ starts
+
+      external randlc
+      double precision randlc, rdummy
+!$    integer  omp_get_thread_num, omp_get_num_threads
+!$    external omp_get_thread_num, omp_get_num_threads
+      integer myid, num_threads
+
+      a1 = power( a, nx1 )
+      a2 = power( a, nx1*ny1 )
+
+      call zero3(z,n1,n2,n3)
+
+      i = is1-2+nx1*(is2-2+ny1*(is3-2))
+
+      ai = power( a, i )
+      d1 = ie1 - is1 + 1
+      e1 = ie1 - is1 + 2
+      e2 = ie2 - is2 + 2
+      e3 = ie3 - is3 + 2
+      x0 = x
+      rdummy = randlc( x0, ai )
+
+c---------------------------------------------------------------------
+c     save the starting seeds for the following loop
+c---------------------------------------------------------------------
+      do  i3 = 2, e3
+         starts(i3) = x0
+         rdummy = randlc( x0, a2 )
+      end do
+
+c---------------------------------------------------------------------
+c     fill array
+c---------------------------------------------------------------------
+!$omp parallel do default(shared) private(i2,i3,x1,xx,rdummy)
+!$omp&  shared(e2,e3,d1,a1)
+      do  i3 = 2, e3
+         x1 = starts(i3)
+         do  i2 = 2, e2
+            xx = x1
+            call vranlc( d1, xx, a, z( 2, i2, i3 ))
+            rdummy = randlc( x1, a1 )
+         enddo
+      enddo
+!$omp end parallel do
+
+c---------------------------------------------------------------------
+c       call comm3(z,n1,n2,n3)
+c       call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     each thread looks for twenty candidates
+c---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,i0,i1,i2,i3,j1,j2,j3,ten,
+!$omp&  myid,num_threads) shared(best0,best1,n1,n2,n3)
+      do  i=1,mm
+         ten( i, 1 ) = 0.0D0
+         j1( i, 1 ) = 0
+         j2( i, 1 ) = 0
+         j3( i, 1 ) = 0
+         ten( i, 0 ) = 1.0D0
+         j1( i, 0 ) = 0
+         j2( i, 0 ) = 0
+         j3( i, 0 ) = 0
+      enddo
+
+!$omp do
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               if( z(i1,i2,i3) .gt. ten( 1, 1 ) )then
+                  ten(1,1) = z(i1,i2,i3) 
+                  j1(1,1) = i1
+                  j2(1,1) = i2
+                  j3(1,1) = i3
+                  call bubble( ten, j1, j2, j3, mm, 1 )
+               endif
+               if( z(i1,i2,i3) .lt. ten( 1, 0 ) )then
+                  ten(1,0) = z(i1,i2,i3) 
+                  j1(1,0) = i1
+                  j2(1,0) = i2
+                  j3(1,0) = i3
+                  call bubble( ten, j1, j2, j3, mm, 0 )
+               endif
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+
+c---------------------------------------------------------------------
+c     Now which of these are globally best?
+c---------------------------------------------------------------------
+      i1 = mm
+      i0 = mm
+      myid = 0
+!$    myid = omp_get_thread_num()
+!$    num_threads = omp_get_num_threads()
+      do  i=mm,1,-1
+
+c ... ORDERED access is required here for sequential consistency
+c ... in case that two values are identical.
+c ... Since an "ORDERED" section is only defined in OpenMP 2,
+c ... we use a dummy loop to emulate ordered access in OpenMP 1.x.
+!$omp master
+         best1 = 0.0D0
+         best0 = 1.0D0
+!$omp end master
+
+!$omp do ordered schedule(static)
+!$       do i2=1,num_threads
+!$omp ordered
+         if (ten(i1,1) .gt. best1) then
+            best1 = ten(i1,1)
+            jg( 0, i, 1 ) = myid
+         endif
+         if (ten(i0,0) .lt. best0) then
+            best0 = ten(i0,0)
+            jg( 0, i, 0 ) = myid
+         endif
+!$omp end ordered
+!$       end do
+
+         if (myid .eq. jg( 0, i, 1 )) then
+            jg( 1, i, 1 ) = j1( i1, 1 )
+            jg( 2, i, 1 ) = j2( i1, 1 )
+            jg( 3, i, 1 ) = j3( i1, 1 )
+            i1 = i1-1
+         endif
+
+         if (myid .eq. jg( 0, i, 0 )) then
+            jg( 1, i, 0 ) = j1( i0, 0 )
+            jg( 2, i, 0 ) = j2( i0, 0 )
+            jg( 3, i, 0 ) = j3( i0, 0 )
+            i0 = i0-1
+         endif
+
+      enddo
+!$omp end parallel
+
+c      mm1 = i1+1
+c      mm0 = i0+1
+      mm1 = 1
+      mm0 = 1
+
+c     write(*,*)' '
+c     write(*,*)' negative charges at'
+c     write(*,9)(jg(1,i,0),jg(2,i,0),jg(3,i,0),i=1,mm)
+c     write(*,*)' positive charges at'
+c     write(*,9)(jg(1,i,1),jg(2,i,1),jg(3,i,1),i=1,mm)
+c     write(*,*)' small random numbers were'
+c     write(*,8)(ten( i,0),i=mm,1,-1)
+c     write(*,*)' and they were found on processor number'
+c     write(*,7)(jg(0,i,0),i=mm,1,-1)
+c     write(*,*)' large random numbers were'
+c     write(*,8)(ten( i,1),i=mm,1,-1)
+c     write(*,*)' and they were found on processor number'
+c     write(*,7)(jg(0,i,1),i=mm,1,-1)
+c 9    format(5(' (',i3,2(',',i3),')'))
+c 8    format(5D15.8)
+c 7    format(10i4)
+
+!$omp parallel do default(shared) private(i1,i2,i3)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3) = 0.0D0
+            enddo
+         enddo
+      enddo
+!$omp end parallel do
+
+      do  i=mm,mm0,-1
+         z( jg(1,i,0), jg(2,i,0), jg(3,i,0) ) = -1.0D0
+      enddo
+      do  i=mm,mm1,-1
+         z( jg(1,i,1), jg(2,i,1), jg(3,i,1) ) = +1.0D0
+      enddo
+
+      call comm3(z,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c          call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine showall(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1,n2,n3,i1,i2,i3
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i3=1,m3
+         do  i1=1,m1
+            write(*,6)(z(i1,i2,i3),i2=1,m2)
+         enddo
+         write(*,*)' - - - - - - - '
+      enddo
+      write(*,*)'  '
+ 6    format(15f6.3)
+
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function power( a, n )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     power  raises an integer, disguised as a double
+c     precision real, to an integer power
+c---------------------------------------------------------------------
+      implicit none
+
+      double precision a, aj
+      integer n, nj
+      external randlc
+      double precision randlc, rdummy
+
+      power = 1.0D0
+      nj = n
+      aj = a
+ 100  continue
+
+      if( nj .eq. 0 ) goto 200
+      if( mod(nj,2) .eq. 1 ) rdummy =  randlc( power, aj )
+      rdummy = randlc( aj, aj )
+      nj = nj/2
+      go to 100
+
+ 200  continue
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine bubble( ten, j1, j2, j3, m, ind )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     bubble        does a bubble sort in direction dir
+c---------------------------------------------------------------------
+      implicit none
+
+
+      integer m, ind, j1( m, 0:1 ), j2( m, 0:1 ), j3( m, 0:1 )
+      double precision ten( m, 0:1 )
+      double precision temp
+      integer i, j_temp
+
+      if( ind .eq. 1 )then
+
+         do  i=1,m-1
+            if( ten(i,ind) .gt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      else
+
+         do  i=1,m-1
+            if( ten(i,ind) .lt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zero3(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1, n2, n3
+      double precision z(n1,n2,n3)
+      integer i1, i2, i3
+
+!$omp parallel do default(shared) private(i1,i2,i3)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3)=0.0D0
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c----- end of program ------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.input.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.input.sample
new file mode 100644
index 0000000..a4dcf81
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/MG/mg.input.sample
@@ -0,0 +1,4 @@
+ 8 = top level
+ 256 256 256 = nx ny nz
+ 20 = nit
+ 0 0 0 0 0 0 0 0 = debug_vec
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/Makefile
new file mode 100644
index 0000000..b78cc10
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/Makefile
@@ -0,0 +1,72 @@
+SHELL=/bin/sh
+CLASS=W
+VERSION=
+SFILE=config/suite.def
+
+default: header
+	@ sys/print_instructions
+
+BT: bt
+bt: header
+	cd BT; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+SP: sp		       
+sp: header	       
+	cd SP; $(MAKE) CLASS=$(CLASS)
+
+LU: lu		       
+lu: header	       
+	cd LU; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+MG: mg		       
+mg: header	       
+	cd MG; $(MAKE) CLASS=$(CLASS)
+
+FT: ft		       
+ft: header	       
+	cd FT; $(MAKE) CLASS=$(CLASS)
+
+IS: is		       
+is: header	       
+	cd IS; $(MAKE) CLASS=$(CLASS)
+
+CG: cg		       
+cg: header	       
+	cd CG; $(MAKE) CLASS=$(CLASS)
+
+EP: ep		       
+ep: header	       
+	cd EP; $(MAKE) CLASS=$(CLASS)
+
+UA: ua
+ua: header	       
+	cd UA; $(MAKE) CLASS=$(CLASS)
+
+DC: dc
+dc: header	       
+	cd DC; $(MAKE) CLASS=$(CLASS)
+
+# Awk script courtesy cmg@cray.com, modified by Haoqiang Jin
+suite:
+	@ awk -f sys/suite.awk SMAKE=$(MAKE) $(SFILE) | $(SHELL)
+
+
+# It would be nice to make clean in each subdirectory (the targets
+# are defined) but on a really clean system this will won't work
+# because those makefiles need config/make.def
+clean:
+	- rm -f core 
+	- rm -f *~ */core */*~ */*.o */npbparams.h */*.obj */*.exe
+	- rm -f sys/setparams sys/makesuite sys/setparams.h
+	- rm -rf */rii_files
+
+veryclean: clean
+	- rm -f config/make.def config/suite.def 
+	- rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.*
+	- rm -f bin/ep.* bin/cg.* bin/ua.* bin/dc.*
+
+header:
+	@ sys/print_header
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README
new file mode 100644
index 0000000..6264d26
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README
@@ -0,0 +1,37 @@
+The OpenMP implementation of NPB 3.3 (NPB3.3-OMP)
+--------------------------------------------------
+
+For problem reports and suggestions on the implementation, 
+please contact:
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+This directory contains the OpenMP implementation of the NAS
+Parallel Benchmarks, Version 3.3 (NPB3.3-OMP).  A brief
+summary of the new features introduced in this version is
+given below.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+For explanation of compilation and running of the benchmarks,
+please refer to README.install.
+
+
+NPB3.3-OMP introduces a new problem size (class E) to seven of 
+the benchmarks (BT, SP, LU, CG, MG, FT, and EP). The version 
+also includes a new problem size (class D) for the IS benchmark, 
+which was not present in the previous releases.
+
+The release is merged with the vector codes for the BT and LU 
+benchmarks, which can be selected with the VERSION=VEC option 
+during compilation.  However, successful vectorization highly 
+depends on the compiler used.  Some changes to compiler directives 
+for vectorization in the current codes (see *_vec.f files)
+may be required.
+
+OMP/LU-HP (the hyper-plane implementation of LU) is no longer 
+included in the distribution.  To get this version, please 
+download NPB3.2.1 instead.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README.install b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README.install
new file mode 100644
index 0000000..9a7a423
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/README.install
@@ -0,0 +1,157 @@
+Some explanations on the OpenMP implementation of NPB 3.3 (NPB3.3-OMP)
+----------------------------------------------------------------------
+
+NPB-OMP is a sample OpenMP implementation based on NPB3-SER,
+the sequential implementation of the NAS Parallel Benchmarks.
+This implementation (NPB3.3-OMP) contains all ten benchmarks:
+eight in Fortran: BT, SP, LU, FT, CG, MG, EP, and UA; two in C: IS
+and DC.  Starting in version 3.3, only the pipeline OpenMP 
+implementation of LU is included.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+This version has been tested, among others, on an SGI Origin3000 and
+an SGI Altix.  For problem reports and suggestions on the implementation, 
+please contact
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+1. Compilation
+
+   NPB3.x-OMP uses the same directory tree as NPB3.x-SER (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary. 
+   If it does not (yet) exist, copy 'make.def.template' or one of the
+   sample files in the NAS.samples subdirectory to 'make.def' and
+   edit the content for site- and machine-specific data.  Then
+
+      make <benchmark> CLASS=<class> [VERSION=VEC]
+
+   <benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, UA, IS, DC) 
+   and <class> is one of (S, W, A, B, C, D, E), except that classes C, 
+   D, and E are not defined for DC and class E is not defined for IS and UA.
+
+   The "VERSION=VEC" option is used for selecting the vectorized 
+   versions of BT and LU.
+
+   Class D for IS (Integer Sort) requires a compiler/system that 
+   supports the "long" type in C to be 64-bit.  As examples, the SGI 
+   MIPS compiler for the SGI Origin using the "-64" compilation flag and
+   the Intel compiler for IA64 are known to work.
+
+   In order to build the class E version of CG, the integer type
+   needs to be promoted to 64-bit, which is usually done through 
+   compilation flag (such as "-i8" for FFLAGS in config/make.def).
+
+   To build a suite of benchmarks, one can create the file 
+   "config/suite.def", which contains a list of executables to build.
+   Each line in the file contains the name of a benchmark and the class,
+   separated by spaces or tabs (see suite.def.template for an example).
+   Then
+
+      make suite
+
+
+   ================================
+   
+   The "RAND" variable in make.def
+   --------------------------------
+   
+   Most of the NPBs use a random number generator. In two of the NPBs (FT
+   and EP) the computation of random numbers is included in the timed
+   part of the calculation, and it is important that the random number
+   generator be efficient.  The default random number generator package
+   provided is called "randi8" and should be used where possible. It has 
+   the following requirements:
+   
+   randi8:
+     1. Uses integer*8 arithmetic. Compiler must support integer*8
+     2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
+     3. Assumes overflow bits are discarded by the hardware. In particular, 
+        that the lowest 46 bits of a*b are always correct, even if the 
+        result a*b is larger than 2^64. 
+   
+   Since randi8 may not work on all machines, we supply the following
+   alternatives:
+   
+   randi8_safe
+     1. Uses integer*8 arithmetic
+     2. Uses the Fortran 90 IBITS intrinsic. 
+     3. Does not make any assumptions about overflow. Should always
+        work correctly if compiler supports integer*8 and IBITS. 
+   
+   randdp
+     1. Uses double precision arithmetic (to simulate integer*8 operations). 
+        Should work with any system with support for 64-bit floating
+        point arithmetic.      
+   
+   randdpvec
+     1. Similar to randdp but written to be easier to vectorize. 
+   
+
+2. Execution
+
+   The executable is named <benchmark-name>.<class>.x and is placed
+   in the bin subdirectory (or in the directory BINDIR specified in
+   make.def, if you've defined it).  Folllowing is an example of running 
+   a benchmark in csh:
+
+      setenv OMP_NUM_THREADS 4
+      bin/bt.A.x > BT.A_out.4
+
+   It runs BT Class A problem on 4 threads and the output is stored
+   in BT.A_out.4.
+
+   Each benchmark includes a set of additional timers for profiling purpose
+   (reporting timing for selected code blocks).  By default, these timers
+   are disabled.  To enable the timers, create a dummy file 'timer.flag' 
+   in the current working directory (not necessarily where the executable 
+   is located) before running a benchmark.
+
+   The printed number of threads is the activated threads during the run,
+   which may not be the same as what is requested.
+
+3. Known issues
+
+   NPB-OMP assumes 'deterministic' static scheduling at run-time to 
+   ensure the correctness of the results.  Verification in some
+   benchmarks might fail if this condition is not met. 
+
+   For larger problem sizes, the default stack size for slave threads
+   may need to be increased on certain platforms.  For example on SGI
+   Origin 3000, the following command can be used:
+      setenv MP_SLAVE_STACKSIZE 50000000 (to about 50MB)
+
+   On SGI Altix using the Intel compiler, the runtime variable would be
+      setenv KMP_STACKSIZE 50m  (for 50MB)
+
+   In order to build the class E version of CG, the integer type
+   needs to be promoted to 64-bit, which is usually done through 
+   compilation flag (such as "-i8" for FFLAGS in config/make.def).
+
+4. Notes on the implementation
+
+   - Based on NPB3.0-SER, except that FT was kept closer to
+     the original version in NPB2.3-serial.
+
+   - OpenMP directives were added to the outer-most parallel loops. 
+     No nested parallelism was considered.
+
+   - Extra loops were added in the beginning of most of the benchmarks
+     to touch data pages.  This is to set up a data layout based on the
+     'first touch' policy.
+
+   - For LU, the pipeline algorithm outperforms the hyperplane algorithm
+     consistently on most modern platforms.  So, only the pipeline 
+     implementation is included.
+
+   - The IS OpenMP benchmark enables bucket sort by default.  To disable
+     bucket sort, comment out the line in IS/is.c:
+     #define USE_BUCKETS
+     See IS/README.carefully for additional information.
+
+   - For Unstructured Adaptive (UA) and DC benchmarks, please see 
+     UA/README or DC/README for additional instruction.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/Makefile
new file mode 100644
index 0000000..e38e907
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/Makefile
@@ -0,0 +1,63 @@
+SHELL=/bin/sh
+BENCHMARK=sp
+BENCHMARKU=SP
+
+include ../config/make.def
+
+
+OBJS = sp.o initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o rhs.o      \
+       x_solve.o ninvr.o y_solve.o pinvr.o    \
+       z_solve.o tzetar.o add.o txinvr.o error.o verify.o  \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect).
+
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o:
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS $<
+else
+	${FCOMPILE} $<
+endif
+
+sp.o:             sp.f  header.h npbparams.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+rhs.o:            rhs.f  header.h npbparams.h
+#lhsx.o:           lhsx.f  header.h npbparams.h
+#lhsy.o:           lhsy.f  header.h npbparams.h
+#lhsz.o:           lhsz.f  header.h npbparams.h
+x_solve.o:        x_solve.f  header.h npbparams.h
+ninvr.o:          ninvr.f  header.h npbparams.h
+y_solve.o:        y_solve.f  header.h npbparams.h
+pinvr.o:          pinvr.f  header.h npbparams.h
+z_solve.o:        z_solve.f  header.h npbparams.h
+tzetar.o:         tzetar.f  header.h npbparams.h
+add.o:            add.f  header.h npbparams.h
+txinvr.o:         txinvr.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h
+verify.o:         verify.f  header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/add.f
new file mode 100644
index 0000000..a38f5c1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/add.f
@@ -0,0 +1,33 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c addition of update to the vector u
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i,j,k,m
+
+       if (timeron) call timer_start(t_add)
+!$omp parallel do default(shared) private(i,j,k,m)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+                do m = 1, 5
+                   u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_add)
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/adi.f
new file mode 100644
index 0000000..6e46da9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/adi.f
@@ -0,0 +1,24 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       call compute_rhs
+
+       call txinvr
+
+       call x_solve
+
+       call y_solve
+
+       call z_solve
+
+       call add
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/error.f
new file mode 100644
index 0000000..58e773b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/error.f
@@ -0,0 +1,111 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function computes the norm of the difference between the
+c computed solution and the exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, m, d
+       double precision xi, eta, zeta, u_exact(5), rms(5), add
+       double precision rms_local(5)
+
+       do m = 1, 5
+          rms(m) = 0.0d0
+       enddo
+
+!$omp parallel default(shared)
+!$omp&        private(i,j,k,m,zeta,eta,xi,add,u_exact,rms_local)
+!$omp&        shared(rms)
+       do m = 1, 5
+          rms_local(m) = 0.0d0
+       enddo
+!$omp do
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             do   i = 0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+                call exact_solution(xi, eta, zeta, u_exact)
+
+                do   m = 1, 5
+                   add = u(m,i,j,k)-u_exact(m)
+                   rms_local(m) = rms_local(m) + add*add
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+       do m = 1, 5
+!$omp atomic
+          rms(m) = rms(m) + rms_local(m)
+       end do
+!$omp end parallel
+
+       do    m = 1, 5
+          do    d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
+
+       subroutine rhs_norm(rms)
+
+       include 'header.h'
+
+       integer i, j, k, d, m
+       double precision rms(5), add
+       double precision rms_local(5)
+
+       do m = 1, 5
+          rms(m) = 0.0d0
+       enddo
+
+!$omp parallel default(shared) private(i,j,k,m,add,rms_local)
+!$omp&        shared(rms)
+       do m = 1, 5
+          rms_local(m) = 0.0d0
+       enddo
+!$omp do
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+                do m = 1, 5
+                   add = rhs(m,i,j,k)
+                   rms_local(m) = rms_local(m) + add*add
+                end do 
+             end do 
+          end do 
+       end do 
+!$omp end do nowait
+       do m = 1, 5
+!$omp atomic
+          rms(m) = rms(m) + rms_local(m)
+       end do
+!$omp end parallel
+
+       do   m = 1, 5
+          do   d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_rhs.f
new file mode 100644
index 0000000..f6b74eb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_rhs.f
@@ -0,0 +1,355 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision dtemp(5), xi, eta, zeta, dtpp
+       integer          m, i, j, k, ip1, im1, jp1, 
+     >                  jm1, km1, kp1
+
+!$omp parallel default(shared)
+!$omp& private(i,j,k,m,zeta,eta,xi,dtpp,im1,ip1,
+!$omp&         jm1,jp1,km1,kp1,dtemp)
+c---------------------------------------------------------------------
+c      initialize                                  
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+       do   k= 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+             do   i = 0, grid_points(1)-1
+                do   m = 1, 5
+                   forcing(m,i,j,k) = 0.0d0
+                end do
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      xi-direction flux differences                      
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+       do   k = 1, grid_points(3)-2
+          zeta = dble(k) * dnzm1
+          do   j = 1, grid_points(2)-2
+             eta = dble(j) * dnym1
+
+             do  i=0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do  m = 1, 5
+                   ue(i,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0 / dtemp(1)
+
+                do  m = 2, 5
+                   buf(i,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(i)   = buf(i,2) * buf(i,2)
+                buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                     buf(i,4) * buf(i,4) 
+                q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                        buf(i,4)*ue(i,4))
+
+             end do
+ 
+             do  i = 1, grid_points(1)-2
+                im1 = i-1
+                ip1 = i+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (
+     >                (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                               buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation                         
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                i = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                i = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                     4.0d0*ue(i+1,m) +       ue(i+2,m))
+             end do
+
+             do   m = 1, 5
+                do  i = 3, grid_points(1)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                i = grid_points(1)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                i = grid_points(1)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+             end do
+
+          end do
+       end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c  eta-direction flux differences             
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+       do   k = 1, grid_points(3)-2          
+          zeta = dble(k) * dnzm1
+          do   i=1, grid_points(1)-2
+             xi = dble(i) * dnxm1
+
+             do  j=0, grid_points(2)-1
+                eta = dble(j) * dnym1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5 
+                   ue(j,m) = dtemp(m)
+                end do
+                dtpp = 1.0d0/dtemp(1)
+
+                do  m = 2, 5
+                   buf(j,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(j)   = buf(j,3) * buf(j,3)
+                buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                     buf(j,4) * buf(j,4)
+                q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                        buf(j,4)*ue(j,4))
+             end do
+
+             do  j = 1, grid_points(2)-2
+                jm1 = j-1
+                jp1 = j+1
+                  
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(
+     >                ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(
+     >                (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(
+     >                ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(
+     >                buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                              buf(jm1,1))+
+     >                yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation                      
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                j = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                j = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                     4.0d0*ue(j+1,m) +       ue(j+2,m))
+             end do
+
+             do   m = 1, 5
+                do  j = 3, grid_points(2)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                j = grid_points(2)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                j = grid_points(2)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+             end do
+
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      zeta-direction flux differences                      
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+       do  j=1, grid_points(2)-2
+          eta = dble(j) * dnym1
+          do   i = 1, grid_points(1)-2
+             xi = dble(i) * dnxm1
+
+             do k=0, grid_points(3)-1
+                zeta = dble(k) * dnzm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5
+                   ue(k,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0/dtemp(1)
+
+                do   m = 2, 5
+                   buf(k,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(k)   = buf(k,4) * buf(k,4)
+                buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +
+     >                     buf(k,3) * buf(k,3)
+                q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                        buf(k,4)*ue(k,4))
+             end do
+
+             do    k=1, grid_points(3)-2
+                km1 = k-1
+                kp1 = k+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (
+     >                (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                              +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                k = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                k = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                     4.0d0*ue(k+1,m) +       ue(k+2,m))
+             end do
+
+             do   m = 1, 5
+                do  k = 3, grid_points(3)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                end do
+             end do
+
+             do    m = 1, 5
+                k = grid_points(3)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                   k = grid_points(3)-2
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                end do
+
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+       do   k = 1, grid_points(3)-2
+          do   j = 1, grid_points(2)-2
+             do   i = 1, grid_points(1)-2
+                do   m = 1, 5
+                   forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+
+       return
+       end
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_solution.f
new file mode 100644
index 0000000..772adc0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/exact_solution.f
@@ -0,0 +1,61 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+c       include 'header.h'
+      implicit none
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+       double precision  xi, eta, zeta
+       double precision  dtemp(5)
+       integer m
+
+       do  m = 1, 5
+          dtemp(m) =  ce(m,1) +
+     >    xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >    eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >    zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >    zeta*ce(m,13))))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/header.h
new file mode 100644
index 0000000..aaaf147
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/header.h
@@ -0,0 +1,112 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           grid_points(3), nx2, ny2, nz2
+      common /global/   grid_points, nx2, ny2, nz2, timeron
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX, IMAXP, JMAXP
+
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+c---------------------------------------------------------------------
+c   To improve cache performance, first two dimensions padded by 1 
+c   for even number sizes only
+c---------------------------------------------------------------------
+      double precision 
+     >   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   speed   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1)
+
+      common /fields/  u, us, vs, ws, qs, rho_i, speed, square, 
+     >                 rhs, forcing
+
+      double precision cv(0:problem_size-1),   rhon(0:problem_size-1),
+     >                 rhos(0:problem_size-1), rhoq(0:problem_size-1),
+     >                 cuf(0:problem_size-1),  q(0:problem_size-1),
+     >                 ue(0:problem_size-1,5), buf(0:problem_size-1,5)
+      common /work_1d/ cv, rhon, rhos, rhoq, cuf, q, ue, buf
+!$omp threadprivate(/work_1d/)
+
+      double precision
+     >   lhs (5,0:IMAXP,0:IMAXP),
+     >   lhsp(5,0:IMAXP,0:IMAXP),
+     >   lhsm(5,0:IMAXP,0:IMAXP)
+      common /work_lhs/ lhs, lhsp, lhsm
+!$omp threadprivate(/work_lhs/)
+
+c-----------------------------------------------------------------------
+c   Timer constants
+c-----------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_xsolve,t_ysolve,t_zsolve,
+     >        t_rdis1,t_rdis2,t_tzetar,t_ninvr,t_pinvr,t_add,
+     >        t_rhs,t_txinvr,t_last,t_total
+      logical timeron
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_rdis2 = 10)
+      parameter (t_txinvr = 11)
+      parameter (t_pinvr = 12)
+      parameter (t_ninvr = 13)
+      parameter (t_tzetar = 14)
+      parameter (t_add = 15)
+      parameter (t_last = 15)
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/initialize.f
new file mode 100644
index 0000000..ac47c6d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/initialize.f
@@ -0,0 +1,281 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This subroutine initializes the field variable u using 
+c tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       integer i, j, k, m, ix, iy, iz
+       double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >                   Pzeta, temp(5)
+    
+
+!$omp parallel default(shared)
+!$omp& private(i,j,k,m,zeta,eta,xi,ix,iy,iz,Pxi,Peta,Pzeta,Pface,temp)
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               u(1,i,j,k) = 1.0
+               u(2,i,j,k) = 0.0
+               u(3,i,j,k) = 0.0
+               u(4,i,j,k) = 0.0
+               u(5,i,j,k) = 1.0
+            end do
+         end do
+      end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+!$omp do schedule(static)
+          do  k = 0, grid_points(3)-1
+             zeta = dble(k) * dnzm1
+             do  j = 0, grid_points(2)-1
+                eta = dble(j) * dnym1
+                do   i = 0, grid_points(1)-1
+                   xi = dble(i) * dnxm1
+                  
+                   do ix = 1, 2
+                      Pxi = dble(ix-1)
+                      call exact_solution(Pxi, eta, zeta, 
+     >                                    Pface(1,1,ix))
+                   end do
+
+                   do    iy = 1, 2
+                      Peta = dble(iy-1)
+                      call exact_solution(xi, Peta, zeta, 
+     >                                    Pface(1,2,iy))
+                   end do
+
+                   do    iz = 1, 2
+                      Pzeta = dble(iz-1)
+                      call exact_solution(xi, eta, Pzeta,   
+     >                                    Pface(1,3,iz))
+                   end do
+
+                   do   m = 1, 5
+                      Pxi   = xi   * Pface(m,1,2) + 
+     >                        (1.0d0-xi)   * Pface(m,1,1)
+                      Peta  = eta  * Pface(m,2,2) + 
+     >                        (1.0d0-eta)  * Pface(m,2,1)
+                      Pzeta = zeta * Pface(m,3,2) + 
+     >                        (1.0d0-zeta) * Pface(m,3,1)
+ 
+                      u(m,i,j,k) = Pxi + Peta + Pzeta - 
+     >                          Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                          Pxi*Peta*Pzeta
+
+                   end do
+                end do
+             end do
+          end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c west face                                                  
+c---------------------------------------------------------------------
+
+       xi = 0.0d0
+       i  = 0
+!$omp do schedule(static)
+       do  k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c east face                                                      
+c---------------------------------------------------------------------
+
+       xi = 1.0d0
+       i  = grid_points(1)-1
+!$omp do schedule(static)
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c south face                                                 
+c---------------------------------------------------------------------
+
+       eta = 0.0d0
+       j   = 0
+!$omp do schedule(static)
+       do  k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   i = 0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+
+c---------------------------------------------------------------------
+c north face                                    
+c---------------------------------------------------------------------
+
+       eta = 1.0d0
+       j   = grid_points(2)-1
+!$omp do schedule(static)
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   i = 0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c bottom face                                       
+c---------------------------------------------------------------------
+
+       zeta = 0.0d0
+       k    = 0
+!$omp do schedule(static)
+       do   j = 0, grid_points(2)-1
+          eta = dble(j) * dnym1
+          do   i =0, grid_points(1)-1
+             xi = dble(i) *dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c top face     
+c---------------------------------------------------------------------
+
+       zeta = 1.0d0
+       k    = grid_points(3)-1
+!$omp do schedule(static)
+       do   j = 0, grid_points(2)-1
+          eta = dble(j) * dnym1
+          do   i =0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+
+       return
+       end
+
+
+       subroutine lhsinit(ni, nj)
+
+       include 'header.h'
+
+       integer ni, nj
+
+       integer j, m
+
+c---------------------------------------------------------------------
+c     zap the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+       do j = 1, nj
+          do   m = 1, 5
+             lhs (m,0,j) = 0.0d0
+             lhsp(m,0,j) = 0.0d0
+             lhsm(m,0,j) = 0.0d0
+             lhs (m,ni,j) = 0.0d0
+             lhsp(m,ni,j) = 0.0d0
+             lhsm(m,ni,j) = 0.0d0
+          end do
+          lhs (3,0,j) = 1.0d0
+          lhsp(3,0,j) = 1.0d0
+          lhsm(3,0,j) = 1.0d0
+          lhs (3,ni,j) = 1.0d0
+          lhsp(3,ni,j) = 1.0d0
+          lhsm(3,ni,j) = 1.0d0
+       end do
+ 
+       return
+       end
+
+
+       subroutine lhsinitj(nj, ni)
+
+       include 'header.h'
+
+       integer nj, ni
+
+       integer i, m
+
+c---------------------------------------------------------------------
+c     zap the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+       do i = 1, ni
+          do   m = 1, 5
+             lhs (m,i,0) = 0.0d0
+             lhsp(m,i,0) = 0.0d0
+             lhsm(m,i,0) = 0.0d0
+             lhs (m,i,nj) = 0.0d0
+             lhsp(m,i,nj) = 0.0d0
+             lhsm(m,i,nj) = 0.0d0
+          end do
+          lhs (3,i,0) = 1.0d0
+          lhsp(3,i,0) = 1.0d0
+          lhsm(3,i,0) = 1.0d0
+          lhs (3,i,nj) = 1.0d0
+          lhsp(3,i,nj) = 1.0d0
+          lhsm(3,i,nj) = 1.0d0
+       end do
+ 
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/inputsp.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/inputsp.data.sample
new file mode 100644
index 0000000..ae3801f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/inputsp.data.sample
@@ -0,0 +1,3 @@
+400       number of time steps
+0.0015d0  dt for class A = 0.0015d0. class B = 0.001d0  class C = 0.00067d0
+64 64 64
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/ninvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/ninvr.f
new file mode 100644
index 0000000..2d9dd67
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/ninvr.f
@@ -0,0 +1,45 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  ninvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication              
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer  i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_ninvr)
+!$omp parallel do default(shared) private(i,j,k,r1,r2,r3,r4,r5,t1,t2)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+               
+                t1 = bt * r3
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) = -r2
+                rhs(2,i,j,k) =  r1
+                rhs(3,i,j,k) = bt * ( r4 - r5 )
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             enddo    
+          enddo
+       enddo
+       if (timeron) call timer_stop(t_ninvr)
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/pinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/pinvr.f
new file mode 100644
index 0000000..9cc2747
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/pinvr.f
@@ -0,0 +1,48 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine pinvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_pinvr)
+!$omp parallel do default(shared) private(i,j,k,r1,r2,r3,r4,r5,t1,t2)
+       do   k = 1, nz2
+          do   j = 1, ny2
+             do   i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = bt * r1
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) =  bt * ( r4 - r5 )
+                rhs(2,i,j,k) = -r3
+                rhs(3,i,j,k) =  r2
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_pinvr)
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/rhs.f
new file mode 100644
index 0000000..1805442
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/rhs.f
@@ -0,0 +1,441 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, m
+       double precision aux, rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >                  wijk, wp1, wm1
+
+
+       if (timeron) call timer_start(t_rhs)
+!$omp parallel default(shared) private(i,j,k,m,rho_inv,aux,uijk,up1,um1,
+!$omp&   vijk,vp1,vm1,wijk,wp1,wm1)
+c---------------------------------------------------------------------
+c      compute the reciprocal of density, and the kinetic energy, 
+c      and the speed of sound. 
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+       do    k = 0, grid_points(3)-1
+          do    j = 0, grid_points(2)-1
+             do    i = 0, grid_points(1)-1
+                rho_inv = 1.0d0/u(1,i,j,k)
+                rho_i(i,j,k) = rho_inv
+                us(i,j,k) = u(2,i,j,k) * rho_inv
+                vs(i,j,k) = u(3,i,j,k) * rho_inv
+                ws(i,j,k) = u(4,i,j,k) * rho_inv
+                square(i,j,k)     = 0.5d0* (
+     >                        u(2,i,j,k)*u(2,i,j,k) + 
+     >                        u(3,i,j,k)*u(3,i,j,k) +
+     >                        u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+                qs(i,j,k) = square(i,j,k) * rho_inv
+c---------------------------------------------------------------------
+c               (don't need speed and ainx until the lhs computation)
+c---------------------------------------------------------------------
+                aux = c1c2*rho_inv* (u(5,i,j,k) - square(i,j,k))
+                speed(i,j,k) = dsqrt(aux)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole grid
+c including the boundary                   
+c---------------------------------------------------------------------
+
+!$omp do schedule(static)
+       do    k = 0, nz2+1
+          do    j = 0, ny2+1
+             do    i = 0, nx2+1
+                do    m = 1, 5
+                   rhs(m,i,j,k) = forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c      compute xi-direction fluxes 
+c---------------------------------------------------------------------
+!$omp master
+       if (timeron) call timer_start(t_rhsx)
+!$omp end master
+!$omp do schedule(static)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                uijk = us(i,j,k)
+                up1  = us(i+1,j,k)
+                um1  = us(i-1,j,k)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 * 
+     >                    (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) + 
+     >                     u(1,i-1,j,k)) -
+     >                    tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 * 
+     >                    (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) + 
+     >                     u(2,i-1,j,k)) +
+     >                    xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                    tx2 * (u(2,i+1,j,k)*up1 - 
+     >                           u(2,i-1,j,k)*um1 +
+     >                           (u(5,i+1,j,k)- square(i+1,j,k)-
+     >                            u(5,i-1,j,k)+ square(i-1,j,k))*
+     >                            c2)
+
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 * 
+     >                    (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +
+     >                     u(3,i-1,j,k)) +
+     >                    xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +
+     >                              vs(i-1,j,k)) -
+     >                    tx2 * (u(3,i+1,j,k)*up1 - 
+     >                           u(3,i-1,j,k)*um1)
+
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 * 
+     >                    (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +
+     >                     u(4,i-1,j,k)) +
+     >                    xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +
+     >                              ws(i-1,j,k)) -
+     >                    tx2 * (u(4,i+1,j,k)*up1 - 
+     >                           u(4,i-1,j,k)*um1)
+
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 * 
+     >                    (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +
+     >                     u(5,i-1,j,k)) +
+     >                    xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +
+     >                              qs(i-1,j,k)) +
+     >                    xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                              um1*um1) +
+     >                    xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) - 
+     >                              2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                              u(5,i-1,j,k)*rho_i(i-1,j,k)) -
+     >                    tx2 * ( (c1*u(5,i+1,j,k) - 
+     >                             c2*square(i+1,j,k))*up1 -
+     >                            (c1*u(5,i-1,j,k) - 
+     >                             c2*square(i-1,j,k))*um1 )
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+          do    j = 1, ny2
+             i = 1
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +
+     >                            u(m,i+2,j,k))
+             end do
+
+             i = 2
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+             end do
+          end do
+
+          do    j = 1, ny2
+             do  i = 3, nx2-2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) + 
+     >                         u(m,i+2,j,k) )
+                end do
+             end do
+          end do
+
+          do    j = 1, ny2
+             i = nx2-1
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+             end do
+
+             i = nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp master
+       if (timeron) call timer_stop(t_rhsx)
+
+c---------------------------------------------------------------------
+c      compute eta-direction fluxes 
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsy)
+!$omp end master
+!$omp do schedule(static)
+       do     k = 1, nz2
+          do     j = 1, ny2
+             do     i = 1, nx2
+                vijk = vs(i,j,k)
+                vp1  = vs(i,j+1,k)
+                vm1  = vs(i,j-1,k)
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 * 
+     >                   (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) + 
+     >                    u(1,i,j-1,k)) -
+     >                   ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 * 
+     >                   (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) + 
+     >                    u(2,i,j-1,k)) +
+     >                   yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) + 
+     >                             us(i,j-1,k)) -
+     >                   ty2 * (u(2,i,j+1,k)*vp1 - 
+     >                          u(2,i,j-1,k)*vm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 * 
+     >                   (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) + 
+     >                    u(3,i,j-1,k)) +
+     >                   yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                   ty2 * (u(3,i,j+1,k)*vp1 - 
+     >                          u(3,i,j-1,k)*vm1 +
+     >                          (u(5,i,j+1,k) - square(i,j+1,k) - 
+     >                           u(5,i,j-1,k) + square(i,j-1,k))
+     >                          *c2)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 * 
+     >                   (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) + 
+     >                    u(4,i,j-1,k)) +
+     >                   yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) + 
+     >                             ws(i,j-1,k)) -
+     >                   ty2 * (u(4,i,j+1,k)*vp1 - 
+     >                          u(4,i,j-1,k)*vm1)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 * 
+     >                   (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) + 
+     >                    u(5,i,j-1,k)) +
+     >                   yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) + 
+     >                             qs(i,j-1,k)) +
+     >                   yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                             vm1*vm1) +
+     >                   yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) - 
+     >                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                             u(5,i,j-1,k)*rho_i(i,j-1,k)) -
+     >                   ty2 * ((c1*u(5,i,j+1,k) - 
+     >                           c2*square(i,j+1,k)) * vp1 -
+     >                          (c1*u(5,i,j-1,k) - 
+     >                           c2*square(i,j-1,k)) * vm1)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+
+          j = 1
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +
+     >                            u(m,i,j+2,k))
+             end do
+          end do
+
+          j = 2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+             end do
+          end do
+
+          do    j = 3, ny2-2
+             do  i = 1,nx2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) + 
+     >                         u(m,i,j+2,k) )
+                end do
+             end do
+          end do
+ 
+          j = ny2-1
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+             end do
+          end do
+
+          j = ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+!$omp end do
+!$omp master
+       if (timeron) call timer_stop(t_rhsy)
+
+c---------------------------------------------------------------------
+c      compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsz)
+!$omp end master
+!$omp do schedule(static)
+       do    k = 1, grid_points(3)-2
+          do     j = 1, grid_points(2)-2
+             do     i = 1, grid_points(1)-2
+                wijk = ws(i,j,k)
+                wp1  = ws(i,j,k+1)
+                wm1  = ws(i,j,k-1)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 * 
+     >                   (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) + 
+     >                    u(1,i,j,k-1)) -
+     >                   tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 * 
+     >                   (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) + 
+     >                    u(2,i,j,k-1)) +
+     >                   zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) + 
+     >                             us(i,j,k-1)) -
+     >                   tz2 * (u(2,i,j,k+1)*wp1 - 
+     >                          u(2,i,j,k-1)*wm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 * 
+     >                   (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) + 
+     >                    u(3,i,j,k-1)) +
+     >                   zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) + 
+     >                             vs(i,j,k-1)) -
+     >                   tz2 * (u(3,i,j,k+1)*wp1 - 
+     >                          u(3,i,j,k-1)*wm1)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 * 
+     >                   (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) + 
+     >                    u(4,i,j,k-1)) +
+     >                   zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                   tz2 * (u(4,i,j,k+1)*wp1 - 
+     >                          u(4,i,j,k-1)*wm1 +
+     >                          (u(5,i,j,k+1) - square(i,j,k+1) - 
+     >                           u(5,i,j,k-1) + square(i,j,k-1))
+     >                          *c2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 * 
+     >                   (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) + 
+     >                    u(5,i,j,k-1)) +
+     >                   zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) + 
+     >                             qs(i,j,k-1)) +
+     >                   zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                             wm1*wm1) +
+     >                   zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) - 
+     >                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                             u(5,i,j,k-1)*rho_i(i,j,k-1)) -
+     >                   tz2 * ( (c1*u(5,i,j,k+1) - 
+     >                            c2*square(i,j,k+1))*wp1 -
+     >                           (c1*u(5,i,j,k-1) - 
+     >                            c2*square(i,j,k-1))*wm1)
+             end do
+          end do
+       end do
+!$omp end do
+
+c---------------------------------------------------------------------
+c      add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+
+       k = 1
+!$omp do schedule(static)
+       do     j = 1, grid_points(2)-2
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +
+     >                            u(m,i,j,k+2))
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+       k = 2
+!$omp do schedule(static)
+       do     j = 1, grid_points(2)-2
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+!$omp do schedule(static)
+       do     k = 3, grid_points(3)-4
+          do     j = 1, grid_points(2)-2
+             do     i = 1,grid_points(1)-2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) + 
+     >                         u(m,i,j,k+2) )
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+ 
+       k = grid_points(3)-3
+!$omp do schedule(static)
+       do     j = 1, grid_points(2)-2
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+       k = grid_points(3)-2
+!$omp do schedule(static)
+       do     j = 1, grid_points(2)-2
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+!$omp end do
+!$omp master
+       if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+
+!$omp do schedule(static)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                do    m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) * dt
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+        if (timeron) call timer_stop(t_rhs)
+   
+       return
+       end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/set_constants.f
new file mode 100644
index 0000000..63ce72b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/set_constants.f
@@ -0,0 +1,203 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       ce(1,1)  = 2.0d0
+       ce(1,2)  = 0.0d0
+       ce(1,3)  = 0.0d0
+       ce(1,4)  = 4.0d0
+       ce(1,5)  = 5.0d0
+       ce(1,6)  = 3.0d0
+       ce(1,7)  = 0.5d0
+       ce(1,8)  = 0.02d0
+       ce(1,9)  = 0.01d0
+       ce(1,10) = 0.03d0
+       ce(1,11) = 0.5d0
+       ce(1,12) = 0.4d0
+       ce(1,13) = 0.3d0
+ 
+       ce(2,1)  = 1.0d0
+       ce(2,2)  = 0.0d0
+       ce(2,3)  = 0.0d0
+       ce(2,4)  = 0.0d0
+       ce(2,5)  = 1.0d0
+       ce(2,6)  = 2.0d0
+       ce(2,7)  = 3.0d0
+       ce(2,8)  = 0.01d0
+       ce(2,9)  = 0.03d0
+       ce(2,10) = 0.02d0
+       ce(2,11) = 0.4d0
+       ce(2,12) = 0.3d0
+       ce(2,13) = 0.5d0
+
+       ce(3,1)  = 2.0d0
+       ce(3,2)  = 2.0d0
+       ce(3,3)  = 0.0d0
+       ce(3,4)  = 0.0d0
+       ce(3,5)  = 0.0d0
+       ce(3,6)  = 2.0d0
+       ce(3,7)  = 3.0d0
+       ce(3,8)  = 0.04d0
+       ce(3,9)  = 0.03d0
+       ce(3,10) = 0.05d0
+       ce(3,11) = 0.3d0
+       ce(3,12) = 0.5d0
+       ce(3,13) = 0.4d0
+
+       ce(4,1)  = 2.0d0
+       ce(4,2)  = 2.0d0
+       ce(4,3)  = 0.0d0
+       ce(4,4)  = 0.0d0
+       ce(4,5)  = 0.0d0
+       ce(4,6)  = 2.0d0
+       ce(4,7)  = 3.0d0
+       ce(4,8)  = 0.03d0
+       ce(4,9)  = 0.05d0
+       ce(4,10) = 0.04d0
+       ce(4,11) = 0.2d0
+       ce(4,12) = 0.1d0
+       ce(4,13) = 0.3d0
+
+       ce(5,1)  = 5.0d0
+       ce(5,2)  = 4.0d0
+       ce(5,3)  = 3.0d0
+       ce(5,4)  = 2.0d0
+       ce(5,5)  = 0.1d0
+       ce(5,6)  = 0.4d0
+       ce(5,7)  = 0.3d0
+       ce(5,8)  = 0.05d0
+       ce(5,9)  = 0.04d0
+       ce(5,10) = 0.03d0
+       ce(5,11) = 0.1d0
+       ce(5,12) = 0.3d0
+       ce(5,13) = 0.2d0
+
+       c1 = 1.4d0
+       c2 = 0.4d0
+       c3 = 0.1d0
+       c4 = 1.0d0
+       c5 = 1.4d0
+
+       bt = dsqrt(0.5d0)
+
+       dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+       dnym1 = 1.0d0 / dble(grid_points(2)-1)
+       dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+       c1c2 = c1 * c2
+       c1c5 = c1 * c5
+       c3c4 = c3 * c4
+       c1345 = c1c5 * c3c4
+
+       conz1 = (1.0d0-c1c5)
+
+       tx1 = 1.0d0 / (dnxm1 * dnxm1)
+       tx2 = 1.0d0 / (2.0d0 * dnxm1)
+       tx3 = 1.0d0 / dnxm1
+
+       ty1 = 1.0d0 / (dnym1 * dnym1)
+       ty2 = 1.0d0 / (2.0d0 * dnym1)
+       ty3 = 1.0d0 / dnym1
+ 
+       tz1 = 1.0d0 / (dnzm1 * dnzm1)
+       tz2 = 1.0d0 / (2.0d0 * dnzm1)
+       tz3 = 1.0d0 / dnzm1
+
+       dx1 = 0.75d0
+       dx2 = 0.75d0
+       dx3 = 0.75d0
+       dx4 = 0.75d0
+       dx5 = 0.75d0
+
+       dy1 = 0.75d0
+       dy2 = 0.75d0
+       dy3 = 0.75d0
+       dy4 = 0.75d0
+       dy5 = 0.75d0
+
+       dz1 = 1.0d0
+       dz2 = 1.0d0
+       dz3 = 1.0d0
+       dz4 = 1.0d0
+       dz5 = 1.0d0
+
+       dxmax = dmax1(dx3, dx4)
+       dymax = dmax1(dy2, dy4)
+       dzmax = dmax1(dz2, dz3)
+
+       dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+       c4dssp = 4.0d0 * dssp
+       c5dssp = 5.0d0 * dssp
+
+       dttx1 = dt*tx1
+       dttx2 = dt*tx2
+       dtty1 = dt*ty1
+       dtty2 = dt*ty2
+       dttz1 = dt*tz1
+       dttz2 = dt*tz2
+
+       c2dttx1 = 2.0d0*dttx1
+       c2dtty1 = 2.0d0*dtty1
+       c2dttz1 = 2.0d0*dttz1
+
+       dtdssp = dt*dssp
+
+       comz1  = dtdssp
+       comz4  = 4.0d0*dtdssp
+       comz5  = 5.0d0*dtdssp
+       comz6  = 6.0d0*dtdssp
+
+       c3c4tx3 = c3c4*tx3
+       c3c4ty3 = c3c4*ty3
+       c3c4tz3 = c3c4*tz3
+
+       dx1tx1 = dx1*tx1
+       dx2tx1 = dx2*tx1
+       dx3tx1 = dx3*tx1
+       dx4tx1 = dx4*tx1
+       dx5tx1 = dx5*tx1
+        
+       dy1ty1 = dy1*ty1
+       dy2ty1 = dy2*ty1
+       dy3ty1 = dy3*ty1
+       dy4ty1 = dy4*ty1
+       dy5ty1 = dy5*ty1
+        
+       dz1tz1 = dz1*tz1
+       dz2tz1 = dz2*tz1
+       dz3tz1 = dz3*tz1
+       dz4tz1 = dz4*tz1
+       dz5tz1 = dz5*tz1
+
+       c2iv  = 2.5d0
+       con43 = 4.0d0/3.0d0
+       con16 = 1.0d0/6.0d0
+        
+       xxcon1 = c3c4tx3*con43*tx3
+       xxcon2 = c3c4tx3*tx3
+       xxcon3 = c3c4tx3*conz1*tx3
+       xxcon4 = c3c4tx3*con16*tx3
+       xxcon5 = c3c4tx3*c1c5*tx3
+
+       yycon1 = c3c4ty3*con43*ty3
+       yycon2 = c3c4ty3*ty3
+       yycon3 = c3c4ty3*conz1*ty3
+       yycon4 = c3c4ty3*con16*ty3
+       yycon5 = c3c4ty3*c1c5*ty3
+
+       zzcon1 = c3c4tz3*con43*tz3
+       zzcon2 = c3c4tz3*tz3
+       zzcon3 = c3c4tz3*conz1*tz3
+       zzcon4 = c3c4tz3*con16*tz3
+       zzcon5 = c3c4tz3*c1c5*tz3
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/sp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/sp.f
new file mode 100644
index 0000000..726a6ba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/sp.f
@@ -0,0 +1,225 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   S P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB SP code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: R. Van der Wijngaart
+c          W. Saphir
+c          H. Jin
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program SP
+c---------------------------------------------------------------------
+
+       include  'header.h'
+      
+       integer          i, niter, step, fstatus, n3
+       external         timer_read
+       double precision mflops, t, tmax, timer_read, trecs(t_last)
+       logical          verified
+       character        class
+       character        t_names(t_last)*8
+!$     integer  omp_get_max_threads
+!$     external omp_get_max_threads
+
+c---------------------------------------------------------------------
+c      Read input file (if it exists), else take
+c      defaults from parameters
+c---------------------------------------------------------------------
+          
+       open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+       if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_rdis2) = 'redist2'
+         t_names(t_tzetar) = 'tzetar'
+         t_names(t_ninvr) = 'ninvr'
+         t_names(t_pinvr) = 'pinvr'
+         t_names(t_txinvr) = 'txinvr'
+         t_names(t_add) = 'add'
+         close(2)
+       else
+         timeron = .false.
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputsp.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233) 
+ 233     format(' Reading from input file inputsp.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234) 
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputsp.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+!$     write(*, 1003) omp_get_max_threads()
+       write(*, *)
+
+ 1000  format(//, ' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >            ' - SP Benchmark', /)
+ 1001  format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002  format(' Iterations: ', i4, '    dt:  ', F11.7)
+ 1003  format(' Number of available threads: ', i5)
+
+       if ( (grid_points(1) .gt. IMAX) .or.
+     >      (grid_points(2) .gt. JMAX) .or.
+     >      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+       nx2 = grid_points(1) - 2
+       ny2 = grid_points(2) - 2
+       nz2 = grid_points(3) - 2
+
+       call set_constants
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call exact_rhs
+
+       call initialize
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+#ifdef HOOKS
+       call roi_begin
+#endif
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or. step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       tmax = timer_read(1)
+
+#ifdef HOOKS
+       call roi_end
+#endif
+
+       call verify(niter, class, verified)
+
+       if( tmax .ne. 0. ) then
+          n3 = grid_points(1)*grid_points(2)*grid_points(3)
+          t = (grid_points(1)+grid_points(2)+grid_points(3))/3.0
+          mflops = (881.174 * float( n3 )
+     >             -4683.91 * t**2
+     >             +11484.5 * t
+     >             -19272.4) * float( niter ) / (tmax*1000000.0d0)
+       else
+          mflops = 0.0
+       endif
+
+      call print_results('SP', class, grid_points(1), 
+     >     grid_points(2), grid_points(3), niter, 
+     >     tmax, mflops, '          floating point', 
+     >     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >     cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+       if (tmax .eq. 0.0) tmax = 1.0
+
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_zsolve) then
+             t = trecs(t_zsolve) - trecs(t_rdis1) - trecs(t_rdis2)
+             write(*,820) 'sub-zsol', t, t*100./tmax
+          elseif (i.eq.t_rdis2) then
+             t = trecs(t_rdis1) + trecs(t_rdis2)
+             write(*,820) 'redist', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/txinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/txinvr.f
new file mode 100644
index 0000000..8ae1df6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/txinvr.f
@@ -0,0 +1,60 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  txinvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c block-diagonal matrix-vector multiplication                  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision t1, t2, t3, ac, ru1, uu, vv, ww, r1, r2, r3, 
+     >                  r4, r5, ac2inv
+
+
+       if (timeron) call timer_start(t_txinvr)
+!$omp parallel do default(shared)
+!$omp& private(i,j,k,t1,t2,t3,ac,ru1,uu,vv,ww,r1,r2,r3,r4,r5,ac2inv)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                ru1 = rho_i(i,j,k)
+                uu = us(i,j,k)
+                vv = vs(i,j,k)
+                ww = ws(i,j,k)
+                ac = speed(i,j,k)
+                ac2inv = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = c2 / ac2inv * ( qs(i,j,k)*r1 - uu*r2  - 
+     >                  vv*r3 - ww*r4 + r5 )
+                t2 = bt * ru1 * ( uu * r1 - r2 )
+                t3 = ( bt * ru1 * ac ) * t1
+
+                rhs(1,i,j,k) = r1 - t1
+                rhs(2,i,j,k) = - ru1 * ( ww*r1 - r4 )
+                rhs(3,i,j,k) =   ru1 * ( vv*r1 - r3 )
+                rhs(4,i,j,k) = - t2 + t3
+                rhs(5,i,j,k) =   t2 + t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_txinvr)
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/tzetar.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/tzetar.f
new file mode 100644
index 0000000..38274d1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/tzetar.f
@@ -0,0 +1,62 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  tzetar
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision  t1, t2, t3, ac, xvel, yvel, zvel, r1, r2, r3, 
+     >                   r4, r5, btuz, ac2u, uzik1
+
+
+       if (timeron) call timer_start(t_tzetar)
+!$omp parallel do default(shared)
+!$omp& private(i,j,k,t1,t2,t3,ac,xvel,yvel,zvel,r1,r2,r3, 
+!$omp&              r4,r5,btuz,ac2u,uzik1)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                xvel = us(i,j,k)
+                yvel = vs(i,j,k)
+                zvel = ws(i,j,k)
+                ac   = speed(i,j,k)
+
+                ac2u = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)      
+
+                uzik1 = u(1,i,j,k)
+                btuz  = bt * uzik1
+
+                t1 = btuz/ac * (r4 + r5)
+                t2 = r3 + t1
+                t3 = btuz * (r4 - r5)
+
+                rhs(1,i,j,k) = t2
+                rhs(2,i,j,k) = -uzik1*r2 + xvel*t2
+                rhs(3,i,j,k) =  uzik1*r1 + yvel*t2
+                rhs(4,i,j,k) =  zvel*t2  + t3
+                rhs(5,i,j,k) =  uzik1*(-xvel*r2 + yvel*r1) + 
+     >                    qs(i,j,k)*t2 + c2iv*ac2u*t1 + zvel*t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_tzetar)
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/verify.f
new file mode 100644
index 0000000..44e11c2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/verify.f
@@ -0,0 +1,356 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 100 time steps, with DT = 1.50d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 100    ))  then
+
+           class = 'S'
+           dtref = 1.5d-2
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.7470315451339479d-02
+           xcrref(2) = 1.0360746705285417d-02
+           xcrref(3) = 1.6235745065095532d-02
+           xcrref(4) = 1.5840557224455615d-02
+           xcrref(5) = 3.4849040609362460d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 2.7289258557377227d-05
+           xceref(2) = 1.0364446640837285d-05
+           xceref(3) = 1.6154798287166471d-05
+           xceref(4) = 1.5750704994480102d-05
+           xceref(5) = 3.4177666183390531d-05
+
+
+c---------------------------------------------------------------------
+c    reference data for 36X36X36 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 36) .and. 
+     >           (grid_points(2) .eq. 36) .and.
+     >           (grid_points(3) .eq. 36) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'W'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1893253733584d-02
+           xcrref(2) = 0.1717075447775d-03
+           xcrref(3) = 0.2778153350936d-03
+           xcrref(4) = 0.2887475409984d-03
+           xcrref(5) = 0.3143611161242d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.7542088599534d-04
+           xceref(2) = 0.6512852253086d-05
+           xceref(3) = 0.1049092285688d-04
+           xceref(4) = 0.1128838671535d-04
+           xceref(5) = 0.1212845639773d-03
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'A'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.4799822399300195d0
+           xcrref(2) = 1.1276337964368832d0
+           xcrref(3) = 1.5028977888770491d0
+           xcrref(4) = 1.4217816211695179d0
+           xcrref(5) = 2.1292113035138280d0
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 1.0900140297820550d-04
+           xceref(2) = 3.7343951769282091d-05
+           xceref(3) = 5.0092785406541633d-05
+           xceref(4) = 4.7671093939528255d-05
+           xceref(5) = 1.3621613399213001d-04
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 400 time steps,
+c    with DT = 1.0d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'B'
+           dtref = 1.0d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6903293579998d+02
+           xcrref(2) = 0.3095134488084d+02
+           xcrref(3) = 0.4103336647017d+02
+           xcrref(4) = 0.3864769009604d+02
+           xcrref(5) = 0.5643482272596d+02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.9810006190188d-02
+           xceref(2) = 0.1022827905670d-02
+           xceref(3) = 0.1720597911692d-02
+           xceref(4) = 0.1694479428231d-02
+           xceref(5) = 0.1847456263981d-01
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 400 time steps,
+c    with DT = 0.67d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'C'
+           dtref = 0.67d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.5881691581829d+03
+           xcrref(2) = 0.2454417603569d+03
+           xcrref(3) = 0.3293829191851d+03
+           xcrref(4) = 0.3081924971891d+03
+           xcrref(5) = 0.4597223799176d+03
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.2598120500183d+00
+           xceref(2) = 0.2590888922315d-01
+           xceref(3) = 0.5132886416320d-01
+           xceref(4) = 0.4806073419454d-01
+           xceref(5) = 0.5483377491301d+00
+
+c---------------------------------------------------------------------
+c    reference data for 408X408X408 grids after 500 time steps,
+c    with DT = 0.3d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'D'
+           dtref = 0.30d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1044696216887d+05
+           xcrref(2) = 0.3204427762578d+04
+           xcrref(3) = 0.4648680733032d+04
+           xcrref(4) = 0.4238923283697d+04
+           xcrref(5) = 0.7588412036136d+04
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.5089471423669d+01
+           xceref(2) = 0.5323514855894d+00
+           xceref(3) = 0.1187051008971d+01
+           xceref(4) = 0.1083734951938d+01
+           xceref(5) = 0.1164108338568d+02
+
+c---------------------------------------------------------------------
+c    reference data for 1020X1020X1020 grids after 500 time steps,
+c    with DT = 0.1d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'E'
+           dtref = 0.10d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6255387422609d+05
+           xcrref(2) = 0.1495317020012d+05
+           xcrref(3) = 0.2347595750586d+05
+           xcrref(4) = 0.2091099783534d+05
+           xcrref(5) = 0.4770412841218d+05
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.6742735164909d+02
+           xceref(2) = 0.5390656036938d+01
+           xceref(3) = 0.1680647196477d+02
+           xceref(4) = 0.1536963126457d+02
+           xceref(5) = 0.1575330146156d+03
+
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/x_solve.f
new file mode 100644
index 0000000..3757775
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/x_solve.f
@@ -0,0 +1,327 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the x-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the x-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, i1, i2, m
+       double precision  ru1, fac1, fac2
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+!$omp parallel do default(shared) private(i,j,k,i1,i2,m,
+!$omp&    ru1,fac1,fac2)
+       do  k = 1, nz2
+
+          call lhsinit(nx2+1, ny2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three x-factors  
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue                   
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do  i = 0, grid_points(1)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(i) = us(i,j,k)
+                rhon(i) = dmax1(dx2+con43*ru1, 
+     >                          dx5+c1c5*ru1,
+     >                          dxmax+ru1,
+     >                          dx1)
+             end do
+
+             do  i = 1, nx2
+                lhs(1,i,j) =   0.0d0
+                lhs(2,i,j) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
+                lhs(3,i,j) =   1.0d0 + c2dttx1 * rhon(i)
+                lhs(4,i,j) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
+                lhs(5,i,j) =   0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             i = 1
+             lhs(3,i,j) = lhs(3,i,j) + comz5
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+             lhs(5,i,j) = lhs(5,i,j) + comz1
+  
+             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
+             lhs(3,i+1,j) = lhs(3,i+1,j) + comz6
+             lhs(4,i+1,j) = lhs(4,i+1,j) - comz4
+             lhs(5,i+1,j) = lhs(5,i+1,j) + comz1
+          end do
+
+          do  j = 1, ny2
+             do   i=3, grid_points(1)-4
+                lhs(1,i,j) = lhs(1,i,j) + comz1
+                lhs(2,i,j) = lhs(2,i,j) - comz4
+                lhs(3,i,j) = lhs(3,i,j) + comz6
+                lhs(4,i,j) = lhs(4,i,j) - comz4
+                lhs(5,i,j) = lhs(5,i,j) + comz1
+             end do
+          end do
+
+
+          do  j = 1, ny2
+             i = grid_points(1)-3
+             lhs(1,i,j) = lhs(1,i,j) + comz1
+             lhs(2,i,j) = lhs(2,i,j) - comz4
+             lhs(3,i,j) = lhs(3,i,j) + comz6
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+
+             lhs(1,i+1,j) = lhs(1,i+1,j) + comz1
+             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
+             lhs(3,i+1,j) = lhs(3,i+1,j) + comz5
+          end do
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) by adding to 
+c      the first  
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do   i = 1, nx2
+                lhsp(1,i,j) = lhs(1,i,j)
+                lhsp(2,i,j) = lhs(2,i,j) - 
+     >                            dttx2 * speed(i-1,j,k)
+                lhsp(3,i,j) = lhs(3,i,j)
+                lhsp(4,i,j) = lhs(4,i,j) + 
+     >                            dttx2 * speed(i+1,j,k)
+                lhsp(5,i,j) = lhs(5,i,j)
+                lhsm(1,i,j) = lhs(1,i,j)
+                lhsm(2,i,j) = lhs(2,i,j) + 
+     >                            dttx2 * speed(i-1,j,k)
+                lhsm(3,i,j) = lhs(3,i,j)
+                lhsm(4,i,j) = lhs(4,i,j) - 
+     >                            dttx2 * speed(i+1,j,k)
+                lhsm(5,i,j) = lhs(5,i,j)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      perform the Thomas algorithm; first, FORWARD ELIMINATION     
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                fac1      = 1.d0/lhs(3,i,j)
+                lhs(4,i,j)  = fac1*lhs(4,i,j)
+                lhs(5,i,j)  = fac1*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i1,j) = lhs(3,i1,j) -
+     >                         lhs(2,i1,j)*lhs(4,i,j)
+                lhs(4,i1,j) = lhs(4,i1,j) -
+     >                         lhs(2,i1,j)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                         lhs(2,i1,j)*rhs(m,i,j,k)
+                end do
+                lhs(2,i2,j) = lhs(2,i2,j) -
+     >                         lhs(1,i2,j)*lhs(4,i,j)
+                lhs(3,i2,j) = lhs(3,i2,j) -
+     >                         lhs(1,i2,j)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                         lhs(1,i2,j)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             fac1      = 1.d0/lhs(3,i,j)
+             lhs(4,i,j)  = fac1*lhs(4,i,j)
+             lhs(5,i,j)  = fac1*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i1,j) = lhs(3,i1,j) -
+     >                      lhs(2,i1,j)*lhs(4,i,j)
+             lhs(4,i1,j) = lhs(4,i1,j) -
+     >                      lhs(2,i1,j)*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhs(2,i1,j)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c            scale the last row immediately 
+c---------------------------------------------------------------------
+             fac2             = 1.d0/lhs(3,i1,j)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = fac2*rhs(m,i1,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,j)
+                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,i1,j) = lhsp(3,i1,j) -
+     >                        lhsp(2,i1,j)*lhsp(4,i,j)
+                lhsp(4,i1,j) = lhsp(4,i1,j) -
+     >                        lhsp(2,i1,j)*lhsp(5,i,j)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                        lhsp(2,i1,j)*rhs(m,i,j,k)
+                lhsp(2,i2,j) = lhsp(2,i2,j) -
+     >                        lhsp(1,i2,j)*lhsp(4,i,j)
+                lhsp(3,i2,j) = lhsp(3,i2,j) -
+     >                        lhsp(1,i2,j)*lhsp(5,i,j)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                        lhsp(1,i2,j)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,j)
+                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,i1,j) = lhsm(3,i1,j) -
+     >                        lhsm(2,i1,j)*lhsm(4,i,j)
+                lhsm(4,i1,j) = lhsm(4,i1,j) -
+     >                        lhsm(2,i1,j)*lhsm(5,i,j)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                        lhsm(2,i1,j)*rhs(m,i,j,k)
+                lhsm(2,i2,j) = lhsm(2,i2,j) -
+     >                        lhsm(1,i2,j)*lhsm(4,i,j)
+                lhsm(3,i2,j) = lhsm(3,i2,j) -
+     >                        lhsm(1,i2,j)*lhsm(5,i,j)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                        lhsm(1,i2,j)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,j)
+             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,i1,j) = lhsp(3,i1,j) -
+     >                      lhsp(2,i1,j)*lhsp(4,i,j)
+             lhsp(4,i1,j) = lhsp(4,i1,j) -
+     >                      lhsp(2,i1,j)*lhsp(5,i,j)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhsp(2,i1,j)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,j)
+             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,i1,j) = lhsm(3,i1,j) -
+     >                      lhsm(2,i1,j)*lhsm(4,i,j)
+             lhsm(4,i1,j) = lhsm(4,i1,j) -
+     >                      lhsm(2,i1,j)*lhsm(5,i,j)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhsm(2,i1,j)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately
+c---------------------------------------------------------------------
+             rhs(4,i1,j,k) = rhs(4,i1,j,k)/lhsp(3,i1,j)
+             rhs(5,i1,j,k) = rhs(5,i1,j,k)/lhsm(3,i1,j)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                             lhs(4,i,j)*rhs(m,i1,j,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                          lhsp(4,i,j)*rhs(4,i1,j,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                          lhsm(4,i,j)*rhs(5,i1,j,k)
+          end do
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do    i = grid_points(1)-3, 0, -1
+                i1 = i  + 1
+                i2 = i  + 2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,j)*rhs(m,i1,j,k) -
+     >                          lhs(5,i,j)*rhs(m,i2,j,k)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,j)*rhs(4,i1,j,k) -
+     >                          lhsp(5,i,j)*rhs(4,i2,j,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,j)*rhs(5,i1,j,k) -
+     >                          lhsm(5,i,j)*rhs(5,i2,j,k)
+             end do
+          end do
+
+       end do
+       if (timeron) call timer_stop(t_xsolve)
+
+c---------------------------------------------------------------------
+c      Do the block-diagonal inversion          
+c---------------------------------------------------------------------
+       call ninvr
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/y_solve.f
new file mode 100644
index 0000000..6fb4a6f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/y_solve.f
@@ -0,0 +1,319 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the y-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the y-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, j1, j2, m
+       double precision ru1, fac1, fac2
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+!$omp parallel do default(shared) private(i,j,k,j1,j2,m,
+!$omp&    ru1,fac1,fac2)
+       do  k = 1, nz2
+
+          call lhsinitj(ny2+1, nx2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue         
+c---------------------------------------------------------------------
+
+          do  i = 1, grid_points(1)-2
+             do  j = 0, grid_points(2)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(j) = vs(i,j,k)
+                rhoq(j) = dmax1( dy3 + con43 * ru1,
+     >                           dy5 + c1c5*ru1,
+     >                           dymax + ru1,
+     >                           dy1)
+             end do
+            
+             do  j = 1, grid_points(2)-2
+                lhs(1,i,j) =  0.0d0
+                lhs(2,i,j) = -dtty2 * cv(j-1) - dtty1 * rhoq(j-1)
+                lhs(3,i,j) =  1.0 + c2dtty1 * rhoq(j)
+                lhs(4,i,j) =  dtty2 * cv(j+1) - dtty1 * rhoq(j+1)
+                lhs(5,i,j) =  0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+
+          do  i = 1, grid_points(1)-2
+             j = 1
+             lhs(3,i,j) = lhs(3,i,j) + comz5
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+             lhs(5,i,j) = lhs(5,i,j) + comz1
+       
+             lhs(2,i,j+1) = lhs(2,i,j+1) - comz4
+             lhs(3,i,j+1) = lhs(3,i,j+1) + comz6
+             lhs(4,i,j+1) = lhs(4,i,j+1) - comz4
+             lhs(5,i,j+1) = lhs(5,i,j+1) + comz1
+          end do
+
+          do   j=3, grid_points(2)-4
+             do  i = 1, grid_points(1)-2
+
+                lhs(1,i,j) = lhs(1,i,j) + comz1
+                lhs(2,i,j) = lhs(2,i,j) - comz4
+                lhs(3,i,j) = lhs(3,i,j) + comz6
+                lhs(4,i,j) = lhs(4,i,j) - comz4
+                lhs(5,i,j) = lhs(5,i,j) + comz1
+             end do
+          end do
+
+          do  i = 1, grid_points(1)-2
+             j = grid_points(2)-3
+             lhs(1,i,j) = lhs(1,i,j) + comz1
+             lhs(2,i,j) = lhs(2,i,j) - comz4
+             lhs(3,i,j) = lhs(3,i,j) + comz6
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+
+             lhs(1,i,j+1) = lhs(1,i,j+1) + comz1
+             lhs(2,i,j+1) = lhs(2,i,j+1) - comz4
+             lhs(3,i,j+1) = lhs(3,i,j+1) + comz5
+          end do
+
+c---------------------------------------------------------------------
+c      subsequently, do the other two factors                    
+c---------------------------------------------------------------------
+          do    j = 1, grid_points(2)-2
+             do  i = 1, grid_points(1)-2
+                lhsp(1,i,j) = lhs(1,i,j)
+                lhsp(2,i,j) = lhs(2,i,j) - 
+     >                            dtty2 * speed(i,j-1,k)
+                lhsp(3,i,j) = lhs(3,i,j)
+                lhsp(4,i,j) = lhs(4,i,j) + 
+     >                            dtty2 * speed(i,j+1,k)
+                lhsp(5,i,j) = lhs(5,i,j)
+                lhsm(1,i,j) = lhs(1,i,j)
+                lhsm(2,i,j) = lhs(2,i,j) + 
+     >                            dtty2 * speed(i,j-1,k)
+                lhsm(3,i,j) = lhs(3,i,j)
+                lhsm(4,i,j) = lhs(4,i,j) - 
+     >                            dtty2 * speed(i,j+1,k)
+                lhsm(5,i,j) = lhs(5,i,j)
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                fac1      = 1.d0/lhs(3,i,j)
+                lhs(4,i,j)  = fac1*lhs(4,i,j)
+                lhs(5,i,j)  = fac1*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i,j1) = lhs(3,i,j1) -
+     >                         lhs(2,i,j1)*lhs(4,i,j)
+                lhs(4,i,j1) = lhs(4,i,j1) -
+     >                         lhs(2,i,j1)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                         lhs(2,i,j1)*rhs(m,i,j,k)
+                end do
+                lhs(2,i,j2) = lhs(2,i,j2) -
+     >                         lhs(1,i,j2)*lhs(4,i,j)
+                lhs(3,i,j2) = lhs(3,i,j2) -
+     >                         lhs(1,i,j2)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                         lhs(1,i,j2)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             fac1      = 1.d0/lhs(3,i,j)
+             lhs(4,i,j)  = fac1*lhs(4,i,j)
+             lhs(5,i,j)  = fac1*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i,j1) = lhs(3,i,j1) -
+     >                      lhs(2,i,j1)*lhs(4,i,j)
+             lhs(4,i,j1) = lhs(4,i,j1) -
+     >                      lhs(2,i,j1)*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                      lhs(2,i,j1)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c            scale the last row immediately 
+c---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,i,j1)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = fac2*rhs(m,i,j1,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,j)
+                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,i,j1) = lhsp(3,i,j1) -
+     >                       lhsp(2,i,j1)*lhsp(4,i,j)
+                lhsp(4,i,j1) = lhsp(4,i,j1) -
+     >                       lhsp(2,i,j1)*lhsp(5,i,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                       lhsp(2,i,j1)*rhs(m,i,j,k)
+                lhsp(2,i,j2) = lhsp(2,i,j2) -
+     >                       lhsp(1,i,j2)*lhsp(4,i,j)
+                lhsp(3,i,j2) = lhsp(3,i,j2) -
+     >                       lhsp(1,i,j2)*lhsp(5,i,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                       lhsp(1,i,j2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,j)
+                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,i,j1) = lhsm(3,i,j1) -
+     >                       lhsm(2,i,j1)*lhsm(4,i,j)
+                lhsm(4,i,j1) = lhsm(4,i,j1) -
+     >                       lhsm(2,i,j1)*lhsm(5,i,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                       lhsm(2,i,j1)*rhs(m,i,j,k)
+                lhsm(2,i,j2) = lhsm(2,i,j2) -
+     >                       lhsm(1,i,j2)*lhsm(4,i,j)
+                lhsm(3,i,j2) = lhsm(3,i,j2) -
+     >                       lhsm(1,i,j2)*lhsm(5,i,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                       lhsm(1,i,j2)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,j)
+             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,i,j1) = lhsp(3,i,j1) -
+     >                    lhsp(2,i,j1)*lhsp(4,i,j)
+             lhsp(4,i,j1) = lhsp(4,i,j1) -
+     >                    lhsp(2,i,j1)*lhsp(5,i,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -
+     >                    lhsp(2,i,j1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,j)
+             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,i,j1) = lhsm(3,i,j1) -
+     >                    lhsm(2,i,j1)*lhsm(4,i,j)
+             lhsm(4,i,j1) = lhsm(4,i,j1) -
+     >                    lhsm(2,i,j1)*lhsm(5,i,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -
+     >                    lhsm(2,i,j1)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately 
+c---------------------------------------------------------------------
+             rhs(4,i,j1,k)   = rhs(4,i,j1,k)/lhsp(3,i,j1)
+             rhs(5,i,j1,k)   = rhs(5,i,j1,k)/lhsm(3,i,j1)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                           lhs(4,i,j)*rhs(m,i,j1,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                           lhsp(4,i,j)*rhs(4,i,j1,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                           lhsm(4,i,j)*rhs(5,i,j1,k)
+          end do
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do   j = grid_points(2)-3, 0, -1
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,j)*rhs(m,i,j1,k) -
+     >                          lhs(5,i,j)*rhs(m,i,j2,k)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,j)*rhs(4,i,j1,k) -
+     >                          lhsp(5,i,j)*rhs(4,i,j2,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,j)*rhs(5,i,j1,k) -
+     >                          lhsm(5,i,j)*rhs(5,i,j2,k)
+             end do
+          end do
+
+       end do
+       if (timeron) call timer_stop(t_ysolve)
+
+
+       call pinvr
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/z_solve.f
new file mode 100644
index 0000000..b012d99
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/SP/z_solve.f
@@ -0,0 +1,330 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the z-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the z-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, k1, k2, m
+       double precision ru1, fac1, fac2
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c Prepare for z-solve, array redistribution   
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+!$omp parallel do default(shared) private(i,j,k,k1,k2,m,
+!$omp&    ru1,fac1,fac2)
+       do   j = 1, ny2
+
+          call lhsinitj(nz2+1, nx2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c first fill the lhs for the u-eigenvalue                          
+c---------------------------------------------------------------------
+
+          do   i = 1, nx2
+             do   k = 0, nz2 + 1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(k) = ws(i,j,k)
+                rhos(k) = dmax1(dz4 + con43 * ru1,
+     >                          dz5 + c1c5 * ru1,
+     >                          dzmax + ru1,
+     >                          dz1)
+             end do
+
+             do   k =  1, nz2
+                lhs(1,i,k) =  0.0d0
+                lhs(2,i,k) = -dttz2 * cv(k-1) - dttz1 * rhos(k-1)
+                lhs(3,i,k) =  1.0 + c2dttz1 * rhos(k)
+                lhs(4,i,k) =  dttz2 * cv(k+1) - dttz1 * rhos(k+1)
+                lhs(5,i,k) =  0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                                  
+c---------------------------------------------------------------------
+
+          do   i = 1, nx2
+             k = 1
+             lhs(3,i,k) = lhs(3,i,k) + comz5
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+             lhs(5,i,k) = lhs(5,i,k) + comz1
+
+             k = 2
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz6
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+             lhs(5,i,k) = lhs(5,i,k) + comz1
+          end do
+
+          do    k = 3, nz2-2
+             do   i = 1, nx2
+                lhs(1,i,k) = lhs(1,i,k) + comz1
+                lhs(2,i,k) = lhs(2,i,k) - comz4
+                lhs(3,i,k) = lhs(3,i,k) + comz6
+                lhs(4,i,k) = lhs(4,i,k) - comz4
+                lhs(5,i,k) = lhs(5,i,k) + comz1
+             end do
+          end do
+
+          do   i = 1, nx2
+             k = nz2-1
+             lhs(1,i,k) = lhs(1,i,k) + comz1
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz6
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+
+             k = nz2
+             lhs(1,i,k) = lhs(1,i,k) + comz1
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz5
+          end do
+
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) 
+c---------------------------------------------------------------------
+          do    k = 1, nz2
+             do   i = 1, nx2
+                lhsp(1,i,k) = lhs(1,i,k)
+                lhsp(2,i,k) = lhs(2,i,k) - 
+     >                            dttz2 * speed(i,j,k-1)
+                lhsp(3,i,k) = lhs(3,i,k)
+                lhsp(4,i,k) = lhs(4,i,k) + 
+     >                            dttz2 * speed(i,j,k+1)
+                lhsp(5,i,k) = lhs(5,i,k)
+                lhsm(1,i,k) = lhs(1,i,k)
+                lhsm(2,i,k) = lhs(2,i,k) + 
+     >                            dttz2 * speed(i,j,k-1)
+                lhsm(3,i,k) = lhs(3,i,k)
+                lhsm(4,i,k) = lhs(4,i,k) - 
+     >                            dttz2 * speed(i,j,k+1)
+                lhsm(5,i,k) = lhs(5,i,k)
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                fac1      = 1.d0/lhs(3,i,k)
+                lhs(4,i,k)  = fac1*lhs(4,i,k)
+                lhs(5,i,k)  = fac1*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i,k1) = lhs(3,i,k1) -
+     >                         lhs(2,i,k1)*lhs(4,i,k)
+                lhs(4,i,k1) = lhs(4,i,k1) -
+     >                         lhs(2,i,k1)*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                         lhs(2,i,k1)*rhs(m,i,j,k)
+                end do
+                lhs(2,i,k2) = lhs(2,i,k2) -
+     >                         lhs(1,i,k2)*lhs(4,i,k)
+                lhs(3,i,k2) = lhs(3,i,k2) -
+     >                         lhs(1,i,k2)*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                         lhs(1,i,k2)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             fac1      = 1.d0/lhs(3,i,k)
+             lhs(4,i,k)  = fac1*lhs(4,i,k)
+             lhs(5,i,k)  = fac1*lhs(5,i,k)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i,k1) = lhs(3,i,k1) -
+     >                      lhs(2,i,k1)*lhs(4,i,k)
+             lhs(4,i,k1) = lhs(4,i,k1) -
+     >                      lhs(2,i,k1)*lhs(5,i,k)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                      lhs(2,i,k1)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c               scale the last row immediately
+c---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,i,k1)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = fac2*rhs(m,i,j,k1)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors               
+c---------------------------------------------------------------------
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,k)
+                lhsp(4,i,k)  = fac1*lhsp(4,i,k)
+                lhsp(5,i,k)  = fac1*lhsp(5,i,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsp(3,i,k1) = lhsp(3,i,k1) -
+     >                       lhsp(2,i,k1)*lhsp(4,i,k)
+                lhsp(4,i,k1) = lhsp(4,i,k1) -
+     >                       lhsp(2,i,k1)*lhsp(5,i,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                       lhsp(2,i,k1)*rhs(m,i,j,k)
+                lhsp(2,i,k2) = lhsp(2,i,k2) -
+     >                       lhsp(1,i,k2)*lhsp(4,i,k)
+                lhsp(3,i,k2) = lhsp(3,i,k2) -
+     >                       lhsp(1,i,k2)*lhsp(5,i,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                       lhsp(1,i,k2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,k)
+                lhsm(4,i,k)  = fac1*lhsm(4,i,k)
+                lhsm(5,i,k)  = fac1*lhsm(5,i,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsm(3,i,k1) = lhsm(3,i,k1) -
+     >                       lhsm(2,i,k1)*lhsm(4,i,k)
+                lhsm(4,i,k1) = lhsm(4,i,k1) -
+     >                       lhsm(2,i,k1)*lhsm(5,i,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                       lhsm(2,i,k1)*rhs(m,i,j,k)
+                lhsm(2,i,k2) = lhsm(2,i,k2) -
+     >                       lhsm(1,i,k2)*lhsm(4,i,k)
+                lhsm(3,i,k2) = lhsm(3,i,k2) -
+     >                       lhsm(1,i,k2)*lhsm(5,i,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                       lhsm(1,i,k2)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,k)
+             lhsp(4,i,k)  = fac1*lhsp(4,i,k)
+             lhsp(5,i,k)  = fac1*lhsp(5,i,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsp(3,i,k1) = lhsp(3,i,k1) -
+     >                    lhsp(2,i,k1)*lhsp(4,i,k)
+             lhsp(4,i,k1) = lhsp(4,i,k1) -
+     >                    lhsp(2,i,k1)*lhsp(5,i,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                    lhsp(2,i,k1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,k)
+             lhsm(4,i,k)  = fac1*lhsm(4,i,k)
+             lhsm(5,i,k)  = fac1*lhsm(5,i,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsm(3,i,k1) = lhsm(3,i,k1) -
+     >                    lhsm(2,i,k1)*lhsm(4,i,k)
+             lhsm(4,i,k1) = lhsm(4,i,k1) -
+     >                    lhsm(2,i,k1)*lhsm(5,i,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                    lhsm(2,i,k1)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately (some of this is overkill
+c               if this is the last cell)
+c---------------------------------------------------------------------
+             rhs(4,i,j,k1) = rhs(4,i,j,k1)/lhsp(3,i,k1)
+             rhs(5,i,j,k1) = rhs(5,i,j,k1)/lhsm(3,i,k1)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                             lhs(4,i,k)*rhs(m,i,j,k1)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                             lhsp(4,i,k)*rhs(4,i,j,k1)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                             lhsm(4,i,k)*rhs(5,i,j,k1)
+          end do
+
+c---------------------------------------------------------------------
+c      Whether or not this is the last processor, we always have
+c      to complete the back-substitution 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do   k = grid_points(3)-3, 0, -1
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,k)*rhs(m,i,j,k1) -
+     >                          lhs(5,i,k)*rhs(m,i,j,k2)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,k)*rhs(4,i,j,k1) -
+     >                          lhsp(5,i,k)*rhs(4,i,j,k2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,k)*rhs(5,i,j,k1) -
+     >                          lhsm(5,i,k)*rhs(5,i,j,k2)
+             end do
+          end do
+
+       end do
+       if (timeron) call timer_stop(t_zsolve)
+
+       call tzetar
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/Makefile
new file mode 100644
index 0000000..f996f20
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/Makefile
@@ -0,0 +1,62 @@
+SHELL=/bin/sh
+BENCHMARK=ua
+BENCHMARKU=UA
+
+include ../config/make.def
+
+
+OBJS = ua.o convect.o diffuse.o adapt.o move.o mason.o \
+       precond.o utils.o verify.o setup.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+ifeq (${HOOKS}, 1)
+        OBJS += ${COMMON}/hooks.o ${COMMON}/m5op_x86.o ${COMMON}/m5_mmap.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect).
+
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS}
+	@if [ x$(UPDATE) = xau ] ; then			\
+		${MAKE} PROGRAM=${PROGRAM} ua-au;	\
+	else						\
+		${MAKE} PROGRAM=${PROGRAM} ua-lk;	\
+	fi
+
+ua-lk: ${OBJS} transfer.o
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM} ${OBJS} transfer.o ${F_LIB}
+
+ua-au: ${OBJS} transfer_au.o
+	${FLINK} ${FLINKFLAGS} -no-pie -o ${PROGRAM}.au ${OBJS} transfer_au.o ${F_LIB}
+
+.f.o:
+ifeq (${HOOKS}, 1)
+	${FCOMPILE} -DHOOKS $<
+else
+	${FCOMPILE} $<
+endif
+
+ua.o:        ua.f       header.h npbparams.h
+setup.o:     setup.f    header.h npbparams.h
+convect.o:   convect.f  header.h npbparams.h
+adapt.o:     adapt.f    header.h npbparams.h
+move.o:      move.f     header.h npbparams.h
+diffuse.o:   diffuse.f  header.h npbparams.h
+mason.o:     mason.f    header.h npbparams.h
+precond.o:   precond.f  header.h npbparams.h
+transfer.o:  transfer.f header.h npbparams.h
+transfer_au.o:  transfer_au.f header.h npbparams.h
+utils.o:     utils.f    header.h npbparams.h
+verify.o:    verify.f   header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/README
new file mode 100644
index 0000000..1072796
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/README
@@ -0,0 +1,25 @@
+Note on the parallelization of transfer.f
+-----------------------------------------
+
+The file contains three major loops that update sparsely overlapped
+mortar points.  Parallelization of these loops requires atomic update
+of memory references by mortar points.  The first implementation
+uses the ATOMIC directive to perform the job.  However, in some systems
+where atomic update of memory references is not available, the ATOMIC
+directive could be implemented as a critical section, which would be 
+very expensive.  An alternative approach is to use locks to guard
+atomic updates.  The second implementation scales reasonably well.
+However, the overhead associated with calling lock routines deep 
+inside loop nests could be large.
+
+Two implementations:
+   - transfer_au.f: use ATOMIC directive for atomic updates
+   - transfer.f: use locks for atomic updates, as the default
+
+To use the first approach, one can either rename 'transfer_au.f'
+to 'transfer.f' before compilation or use the suboption "UPDATE"
+in make:
+
+   % make CLASS=<class> UPDATE=au
+
+where <class> is one of [S,W,A,B,C,D].
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/adapt.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/adapt.f
new file mode 100644
index 0000000..4c08f17
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/adapt.f
@@ -0,0 +1,1211 @@
+c-----------------------------------------------------------
+      subroutine adaptation (ifmortar,step)
+c-----------------------------------------------------------
+c     For 3-D mesh adaptation (refinement+ coarsening)
+c-----------------------------------------------------------
+      include 'header.h'
+      
+      logical if_coarsen,if_refine,ifmortar,ifrepeat
+      integer iel,miel,irefine,icoarsen,neltold,step
+
+      if (timeron) call timer_start(t_adaptation)
+      ifmortar=.false.
+c.....compute heat source center(x0,y0,z0)
+      x0=x00+velx*time
+      y0=y00+vely*time
+      z0=z00+velz*time
+
+c.....Search elements to be refined. Check with restrictions. Perform
+c     refinement repeatedly until all desired refinements are done.
+
+c.....ich(iel)=0 no grid change on element iel
+c.....ich(iel)=2 iel is marked to be coarsened
+c.....ich(iel)=4 iel is marked to be refined
+
+c.....irefine records how many elements got refined
+      irefine=0
+
+c.....check whether elements need to be refined because they have overlap
+c     with the  heat source
+4     call find_refine(if_refine)
+
+      if(if_refine) then
+        ifrepeat=.true.
+2       if(ifrepeat) then
+c.........Check with restriction, unmark elements that cannot be refined.
+c         Elements preventing desired refinement will be marked to be refined.
+          call check_refine(ifrepeat) 
+          go to 2
+        end if
+c.......perform refinement
+        call do_refine(ifmortar,irefine)
+        goto 4
+      endif
+
+c.....Search for elements to be coarsened. Check with restrictions,
+c     Perform coarsening repeatedly until all possible coarsening
+c     is done.
+
+c.....icoarsen records how many elements got coarsened 
+      icoarsen=0
+
+c.....skip(iel)=.true. indicates an element no longer exists (because it
+c     got merged)
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+       do iel=1,nelt
+        skip(iel)=.false.
+      end do
+c$OMP END PARALLEL DO
+
+      neltold=nelt
+
+c.....Check whether elements need to be coarsened because they don't have
+c     overlap with the heat source. Only elements that don't have a larger 
+c     size neighbor can be marked to be coarsened
+
+5     call find_coarsen(if_coarsen,neltold)
+
+      if(if_coarsen) then
+c.......Perform coarsening, however subject to restriction. Only possible 
+c       coarsening will be performed. if_coarsen=.true. indicates that
+c       actual coarsening happened
+        call do_coarsen(if_coarsen,icoarsen,neltold)
+        if(if_coarsen) then
+c.........ifmortar=.true. indicates the grid changed, i.e. the mortar points 
+c         indices need to be regenerated on the new grid.
+          ifmortar=.true.
+          go to 5
+        end if 
+      end if
+
+      write(*,1000) step, irefine, icoarsen, nelt
+ 1000 format('Step ',i4, ': elements refined, merged, total:',
+     &       i6, 1X , i6, 1X, i6)
+
+c.....mt_to_id(miel) takes as argument the morton index  and returns the actual 
+c                    element index
+c.....id_to_mt(iel)  takes as argument the actual element index and returns the 
+c                    morton index
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel)
+      do miel=1,nelt
+        iel=mt_to_id(miel)
+        id_to_mt(iel)=miel
+      end do 
+c$OMP END PARALLEL DO
+
+c.....Reorder the elements in the order of the morton curve. After the move 
+c     subroutine the element indices are  the same as the morton indices
+      call move
+
+c.....if the grid changed, regenerate mortar indices and update variables
+c     associated to grid.
+      if (ifmortar) then
+        call mortar
+        call prepwork
+      endif
+      if (timeron) call timer_stop(t_adaptation)
+
+      return
+      end 
+
+
+c-----------------------------------------------------------
+      subroutine do_coarsen(if_coarsen,icoarsen,neltold)
+c---------------------------------------------------------------
+c     Coarsening procedure: 
+c     1) check with restrictions
+c     2) perform coarsening
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      logical if_coarsen, icheck,test,test1,test2,test3
+      integer iel, ntp(8), ntempmin, ic, parent, mielnew, miel,
+     &        icoarsen, i, index, num_coarsen, ntemp, ii, ntemp1, 
+     &        neltold
+      
+      if_coarsen=.false.
+
+c.....If an element has been merged, it will be skipped afterwards
+c     skip(iel)=.true. for elements that will be skipped.
+c     ifcoa_id(iel)=.true. indicates that element iel will be coarsened
+c     ifcoa(miel)=.true. refers to element miel(mortar index) will be
+c                        coarsened
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iel)
+c$OMP DO 
+      do iel=1,nelt
+        mt_to_id_old(iel)=mt_to_id(iel)
+        mt_to_id(iel)=0
+      end do
+c$OMP END DO nowait
+c$OMP DO 
+      do iel=1,neltold 
+        ifcoa_id(iel)=.false.
+      end do
+c$OMP END DO nowait
+c$OMP END PARALLEL
+
+c.....Check whether the potential coarsening will make neighbor, 
+c     and neighbor's neighbor....break grid restriction
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,ic,
+c$OMP& ntp,parent,test,test1,i,test2,test3)
+c$OMP& SHARED(if_coarsen)
+      do miel=1,nelt
+        ifcoa(miel)=.false.
+        front(miel)=0
+        iel=mt_to_id_old(miel)
+c.......if an element is marked to be coarsened
+        if(ich(iel).eq.2) then
+
+c.........If the current  element is the "first" child (front-left-
+c         bottom) of its parent (tree(iel) mod 8 equals 0), then 
+c         find all its neighbors. Check whether they are from the same 
+c         parent.
+
+          ic=tree(iel)
+          if(.not.btest(ic,0).and..not.btest(ic,1).and.
+     &       .not.btest(ic,2)) then
+            ntp(1)=iel
+            ntp(2)=sje(1,1,1,iel)
+            ntp(3)=sje(1,1,3,iel)
+            ntp(4)=sje(1,1,1,ntp(3))
+            ntp(5)=sje(1,1,5,iel)
+            ntp(6)=sje(1,1,1,ntp(5))
+            ntp(7)=sje(1,1,3,ntp(5))
+            ntp(8)=sje(1,1,1,ntp(7))
+ 
+            parent=ishft(tree(iel),-3)
+            test=.false.
+
+            test1=.true.
+            do i=1,8
+              if(ishft(tree(ntp(i)),-3).ne.parent)test1=.false.
+            end do
+
+c...........check whether all child elements are marked to be coarsened
+            if(test1)then
+              test2=.true.
+              do i=1,8
+                if(ich(ntp(i)).ne.2)test2=.false.
+              end do
+
+c.............check whether all child elements can be coarsened or not.
+              if(test2)then
+                test3=.true.
+                do i=1,8
+                  if(.not.icheck(ntp(i),i))test3=.false.
+                end do
+                if(test3)test=.true.
+              end if
+            end if
+c...........if the eight child elements are eligible to be coarsened
+c           mark the first children ifcoa(miel)=.true.
+c           mark them all ifcoa_id()=.true.
+c           front(miel) will be used to calculate (potentially in parallel) 
+c                       how many elements with seuqnece numbers less than
+c                       miel will be coarsened.
+c           skip()      marks that an element will no longer exist after merge.
+
+            if(test)then
+
+              ifcoa(miel)=.true.
+              do i=1,8
+                ifcoa_id(ntp(i))=.true.
+              end do
+              front(miel)=1
+              do i=1,7
+                 skip(ntp(i+1))=.true.
+              end do
+              if(.not.if_coarsen) if_coarsen=.true.
+            end if
+          end if 
+        end if 
+      end do 
+c$OMP END PARALLEL DO
+
+c.....compute front(iel), how many elements will be coarsened before iel
+c     (including iel)
+      call parallel_add(front)
+
+c.....num_coarsen is the total number of elements that will be coarsened
+      num_coarsen=front(nelt)
+
+c.....action(i) records the morton index of the i'th element (if it is an
+c     element's front-left-bottom-child) to be coarsened.
+
+c.....create array mt_to_id to convert actual element index to morton index
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,mielnew)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(.not.skip(iel))then
+          if(ifcoa(miel))then
+            action(front(miel))=miel
+            mielnew=miel-(front(miel)-1)*7
+          else 
+            mielnew=miel-front(miel)*7
+          end if
+          mt_to_id(mielnew)=iel
+        end if
+      end do
+c$OMP END PARALLEL DO
+
+c.....perform the coarsening procedure (potentially in parallel)
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(index,miel,iel,ntp)
+      do index=1,num_coarsen
+        miel=action(index)
+        iel=mt_to_id_old(miel)
+c.......find eight child elements to be coarsened
+        ntp(1)=iel
+        ntp(2)=sje(1,1,1,iel)
+        ntp(3)=sje(1,1,3,iel)
+        ntp(4)=sje(1,1,1,ntp(3))
+        ntp(5)=sje(1,1,5,iel)
+        ntp(6)=sje(1,1,1,ntp(5))
+        ntp(7)=sje(1,1,3,ntp(5))
+        ntp(8)=sje(1,1,1,ntp(7))
+c.......merge them to be the parent
+        call merging(ntp)
+      end do
+c$OMP END PARALLEL DO
+      nelt=nelt-num_coarsen*7
+      icoarsen=icoarsen+num_coarsen*8
+
+      return
+      end
+
+c-------------------------------------------------------
+      subroutine do_refine(ifmortar,irefine)
+c-------------------------------------------------------
+c     Refinement procedure
+c--------------------------------------------------------
+
+      include 'header.h'
+
+      logical ifmortar
+      double precision xctemp(8), yctemp(8), zctemp(8), xleft, xright,
+     &       yleft, yright, zleft, zright, ta1temp(lx1,lx1,lx1),
+     &       xhalf, yhalf, zhalf
+      integer iel, i, ii, jj, j, jface, 
+     &        ntemp, ndir, facedir, k, le(4), ne(4), mielnew,
+     &        miel, irefine,ntemp1, num_refine, index, treetemp,
+     &        sjetemp(2,2,6), n1, n2, nelttemp,
+     &        cb, cbctemp(6)
+
+c.....initialize
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel)
+      do miel=1,nelt
+        mt_to_id_old(miel)=mt_to_id(miel)
+        mt_to_id(miel)=0
+        action(miel)=0
+        if(ich(mt_to_id_old(miel)).ne.4)then
+          front(miel)=0
+        else
+          front(miel)=1
+        end if
+      end do
+c$OMP END PARALLEL DO
+
+c.....front(iel) records how many elements with sequence numbers less than
+c     or equal to iel will be refined
+      call parallel_add(front)
+
+c.....num_refine is the total number of elements that will be refined
+      num_refine=front(nelt)
+
+c.....action(i) records the morton index of the  i'th element to be refined
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          action(front(miel))=miel
+        end if
+      end do
+c$OMP END PARALLEL DO
+
+c.....Compute array mt_to_id to convert the element index to morton index.
+c     ref_front_id(iel) records how many elements with index less than
+c     iel (actual element index, not morton index), will be refined.
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,ntemp,mielnew)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          ntemp=(front(miel)-1)*7
+          mielnew=miel+ntemp
+        else
+          ntemp=front(miel)*7
+          mielnew=miel+ntemp
+        end if
+
+        mt_to_id(mielnew)=iel
+        ref_front_id(iel)=nelt+ntemp
+      end do
+c$OMP END PARALLEL DO
+
+
+c.....Perform refinement (potentially in parallel): 
+c       - Cut an element into eight children.
+c       - Assign them element index  as iel, nelt+1,...., nelt+7.
+c       - Update neighboring information.
+
+      nelttemp=nelt
+
+      if (num_refine .gt. 0) then
+        ifmortar=.true.
+      endif
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(index,miel,mielnew,iel,nelt,
+c$OMP& treetemp,xctemp,yctemp,zctemp,cbctemp,sjetemp,ta1temp,
+c$OMP& ii,jj,ntemp,xleft,xright,xhalf,yleft,yright,yhalf,zleft,zright,
+c$OMP& zhalf,ndir,facedir,jface,cb,le,ne,n1,n2,i,j,k)
+      do index=1, num_refine  
+c.......miel is old morton index and mielnew is new morton index after refinement.
+        miel=action(index)
+        mielnew=miel+(front(miel)-1)*7
+        iel=mt_to_id_old(miel) 
+        nelt=nelttemp+(front(miel)-1)*7 
+c.......save iel's information in a temporary array
+        treetemp=tree(iel)
+        do i=1,8
+          xctemp(i)=xc(i,iel)
+          yctemp(i)=yc(i,iel)
+          zctemp(i)=zc(i,iel)
+        end do
+        do i=1,6
+          cbctemp(i)=cbc(i,iel)
+          do jj=1,2
+            do ii=1,2
+              sjetemp(ii,jj,i)=sje(ii,jj,i,iel)
+            end do
+          end do
+        end do
+        call copy(ta1temp,ta1(1,1,1,iel),nxyz)
+
+
+c.......zero out iel here
+        tree(iel)=0
+        call nr_init(cbc(1,iel),6,0)
+        call nr_init(sje(1,1,1,iel),24,0)
+        call nr_init(ijel(1,1,iel),12,0)
+        call r_init(ta1(1,1,1,iel),nxyz,0.d0)
+
+
+c.......initialize new child elements:iel and nelt+1~nelt+7
+        do j=1,7 
+          mt_to_id(mielnew+j)=nelt+j
+          tree(nelt+j)=0
+          call nr_init(cbc(1,nelt+j),6,0)
+          call nr_init(sje(1,1,1,nelt+j),24,0)
+          call nr_init(ijel(1,1,nelt+j),12,0)
+          call r_init(ta1(1,1,1,nelt+j),nxyz,0.d0)
+        end do
+          
+c.......update the tree()
+        ntemp=ishft(treetemp,3)
+        tree(iel)=ntemp
+        do i=1,7
+          tree(nelt+i)=ntemp+mod(i,8)
+        end do   
+c.......update the children's vertices' coordinates
+        xhalf=xctemp(1)+(xctemp(2)-xctemp(1))/2.d0
+        xleft=xctemp(1)
+        xright=xctemp(2)
+        yhalf=yctemp(1)+(yctemp(3)-yctemp(1))/2.d0
+        yleft=yctemp(1)
+        yright=yctemp(3)
+        zhalf=zctemp(1)+(zctemp(5)-zctemp(1))/2.d0
+        zleft=zctemp(1)
+        zright=zctemp(5)
+       
+        do j=1,7,2
+          do i=1,7,2
+            xc(i,nelt+j)     = xhalf
+            xc(i+1,nelt+j)   = xright 
+          end do
+        end do
+
+        do j=2,6,2
+          do i=1,7,2
+            xc(i,nelt+j)   = xleft
+            xc(i+1,nelt+j) = xhalf
+          end do
+        end do
+         
+        do i=1,7,2
+          xc(i,iel)=xleft
+          xc(i+1,iel)=xhalf
+        end do
+
+        do i=1,2
+          yc(i,nelt+1)=yleft
+          yc(i,nelt+4)=yleft
+          yc(i,nelt+5)=yleft
+          yc(i+4,nelt+1)=yleft
+          yc(i+4,nelt+4)=yleft
+          yc(i+4,nelt+5)=yleft
+        enddo
+        do i=3,4
+          yc(i,nelt+1)=yhalf
+          yc(i,nelt+4)=yhalf
+          yc(i,nelt+5)=yhalf
+          yc(i+4,nelt+1)=yhalf
+          yc(i+4,nelt+4)=yhalf
+          yc(i+4,nelt+5)=yhalf
+        end do
+        do j=2,3
+          do i=1,2
+            yc(i,nelt+j)=yhalf
+            yc(i,nelt+j+4)=yhalf
+            yc(i+4,nelt+j)=yhalf
+            yc(i+4,nelt+j+4)=yhalf
+          end do
+          do i=3,4
+            yc(i,nelt+j)=yright
+            yc(i,nelt+j+4)=yright
+            yc(i+4,nelt+j)=yright
+            yc(i+4,nelt+j+4)=yright
+          end do
+        end do
+          
+        do i=1,2
+          yc(i,iel)=yleft
+          yc(i+4,iel)=yleft
+        end do
+        do i=3,4
+          yc(i,iel)=yhalf
+          yc(i+4,iel)=yhalf
+        end do
+
+        do j=1,3
+          do i=1,4
+            zc(i,nelt+j)=zleft
+            zc(i+4,nelt+j)=zhalf
+          end do
+        end do
+        do j=4,7
+          do i=1,4
+            zc(i,nelt+j)=zhalf
+            zc(i+4,nelt+j)=zright
+          end do
+        end do
+        do i=1,4
+          zc(i,iel)=zleft
+          zc(i+4,iel)=zhalf
+        end do
+
+c.......update the children's neighbor information
+
+c.......ndir refers to the x,y,z directions, respectively.
+c       facedir refers to the orientation of the face in each direction, 
+c       e.g. ndir=1, facedir=0 refers to face 1,
+c       and ndir =1, facedir=1 refers to face 2.
+
+        do ndir = 1, 3
+          do facedir = 0, 1
+            i=2*ndir-1+facedir
+            jface=jjface(i)
+            cb=cbctemp(i)
+
+c...........find the new element indices of the four children on each
+c           face of the parent element
+            do k = 1, 4
+              le(k) = le_arr(k,facedir,ndir)+nelt
+              ne(k) = le_arr(k,1-facedir,ndir)+nelt
+            end do
+            if(facedir.eq.0)then
+              le(1)=iel
+            else
+              ne(1)=iel
+            end if
+c...........update neighbor information of the four child elements on each 
+c           face of the parent element
+            do k=1,4
+              cbc(i,le(k))=2
+              sje(1,1,i,le(k))=ne(k)
+              ijel(1,i,le(k))=1
+              ijel(2,i,le(k))=1
+            end do
+
+c...........if the face type of the parent element is type 2
+            if(cb.eq.2) then
+              ntemp=sjetemp(1,1,i)
+
+c.............if the neighbor ntemp is not marked to be refined
+              if(ich(ntemp).ne.4)then
+                cbc(jface,ntemp)=3
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+  
+                do k=1,4
+                  cbc(i,ne(k))=1
+                  sje(1,1,i,ne(k))=ntemp
+                  if(k.eq.1) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=1
+                    sje(1,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.2) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=2
+                    sje(1,2,jface,ntemp)=ne(k)
+                  elseif(k.eq.3) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=1
+                    sje(2,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.4) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=2
+                    sje(2,2,jface,ntemp)=ne(k)
+                  end if
+                end do
+
+c.............if the neighbor ntemp is also marked to be refined
+              else
+                n1=ref_front_id(ntemp)
+                 
+                do k=1,4
+                  cbc(i,ne(k))=2
+                  n2=n1+le_arr(k,facedir,ndir)
+                  if(n2.eq.n1+8)n2=ntemp
+                  sje(1,1,i,ne(k))=n2
+                  ijel(1,i,ne(k))=1
+                end do
+
+              endif
+c...........if the face type of the parent element is type 3
+            elseif(cb.eq.3) then
+              do k=1,4
+                cbc(i,ne(k))=2
+                if(k.eq.1) then
+                  ntemp=sjetemp(1,1,i)
+                elseif (k.eq.2) then
+                  ntemp=sjetemp(1,2,i)
+                elseif(k.eq.3) then
+                  ntemp=sjetemp(2,1,i)
+                elseif(k.eq.4) then
+                  ntemp=sjetemp(2,2,i)
+                end if
+                ijel(1,i,ne(k))=1
+                ijel(2,i,ne(k))=1
+                sje(1,1,i,ne(k))=ntemp
+                cbc(jface,ntemp)=2
+                sje(1,1,jface,ntemp)=ne(k)
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+              end do
+
+c...........if the face type of the parent element is type 0
+            elseif(cb.eq.0) then
+              do k=1,4
+                cbc(i,ne(k))=cb
+              end do
+            end if
+
+          end do 
+        end do 
+
+c.......map solution from parent element to children
+        call remap(ta1(1,1,1,iel),ta1(1,1,1,ref_front_id(iel)+1),
+     &             ta1temp(1,1,1))
+      end do
+c$OMP ENDPARALLEL DO
+
+      nelt=nelttemp+num_refine*7
+      irefine=irefine+num_refine
+      ntot=nelt*lx1*lx1*lx1
+      return
+      end
+
+c-----------------------------------------------------------
+       logical function ifcor(n1,n2,i,iface)
+c-----------------------------------------------------------
+c      returns whether element n1's face i and element n2's 
+c      jjface(iface) have intersections, i.e. whether n1 and 
+c      n2 are neighbored by an edge.
+c-----------------------------------------------------------
+
+       include 'header.h'
+
+       integer n1,n2,i,iface
+       logical ifsame
+
+       ifcor=.false.
+
+       if(ifsame(n1,e1v1(iface,i),n2,e2v1(iface,i)).or.
+     &    ifsame(n1,e1v2(iface,i),n2,e2v2(iface,i))) then
+          ifcor=.true.
+       end if
+
+       return
+       end
+
+c-----------------------------------------------------------
+      logical function icheck(ie,n)
+c-----------------------------------------------------------
+c     Check whether element ie's three faces (sharing vertex n)
+c     are nonconforming. This will prevent it from being coarsened.
+c     Also check ie's neighbors on those three faces, whether ie's
+c     neighbors by only an edge have a size smaller than ie's,
+c     which also prevents ie from being coarsened.
+c-----------------------------------------------------------
+      include 'header.h'
+
+      integer ie, n, iside, ntemp1, ntemp2, ntemp3, n1, n2, n3,
+     &cb2_1,cb3_1,cb1_2,cb3_2,cb1_3,cb2_3
+
+      icheck=.true.
+      cb2_1=0
+      cb3_1=0
+      cb1_2=0
+      cb3_2=0
+      cb1_3=0
+      cb2_3=0
+
+      n1=f_c(1,n)
+      n2=f_c(2,n)
+      n3=f_c(3,n)
+      if((cbc(n1,ie).eq.3) .or. (cbc(n2,ie).eq.3) .or.
+     &   (cbc(n3,ie).eq.3)) then
+         icheck=.false.
+      else
+        ntemp1=sje(1,1,n1,ie)
+        ntemp2=sje(1,1,n2,ie)
+        ntemp3=sje(1,1,n3,ie)
+        if(ntemp1.ne.0)then
+           cb2_1=cbc(n2,ntemp1)
+           cb3_1=cbc(n3,ntemp1)
+        end if
+        if(ntemp2.ne.0)then
+           cb3_2=cbc(n3,ntemp2)
+           cb1_2=cbc(n1,ntemp2)
+        end if
+        if(ntemp3.ne.0)then
+           cb1_3=cbc(n1,ntemp3)
+           cb2_3=cbc(n2,ntemp3)
+        end if
+        if((cbc(n1,ie).eq.2.and.(cb2_1.eq.3.or.
+     &                               cb3_1.eq.3)).or.
+     &     (cbc(n2,ie).eq.2.and.(cb3_2.eq.3.or.
+     &                               cb1_2.eq.3)).or.
+     &     (cbc(n3,ie).eq.2.and.(cb1_3.eq.3.or.
+     &                              cb2_3.eq.3)))then
+          icheck=.false.
+        end if
+      end if
+
+      return
+      end 
+
+c-----------------------------------------------------------
+      subroutine find_coarsen(if_coarsen,neltold)
+c-----------------------------------------------------------
+c     Search elements to be coarsened. Check with restrictions.
+c     This subroutine only checks the element itself, not its
+c     neighbors.
+c-----------------------------------------------------------
+      
+      include 'header.h'
+
+      logical if_coarsen, iftemp, iftouch
+      integer iel,i,neltold
+
+      if_coarsen=.false.
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,iftemp)
+c$OMP& SHARED(if_coarsen)
+      do iel=1,neltold
+        if(.not.skip(iel))then
+          ich(iel)=0
+          if(.not.iftouch(iel)) then
+            iftemp=.false.
+            do i=1,nsides
+c.............if iel has a larger size than its face neighbors, it
+c             can not be coarsened
+              if(cbc(i,iel).eq.3) then
+                iftemp=.true.
+              endif
+            enddo
+            if(.not.iftemp) then
+              if(.not.if_coarsen) if_coarsen=.true.
+              ich(iel)=2
+            end if
+          end if
+        endif
+      enddo
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------
+      subroutine find_refine(if_refine)
+c-----------------------------------------------------------
+c     search elements to be refined based on whether they
+c     have overlap with the heat source
+c-----------------------------------------------------------
+
+      include 'header.h'
+
+      logical if_refine, iftouch
+      integer iel
+
+      if_refine=.false.
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+c$OMP& SHARED(if_refine)
+      do iel=1,nelt
+        ich(iel)=0
+        if(iftouch(iel)) then
+          if((xc(2,iel)-xc(1,iel)).gt.dlmin) then
+            if(.not.if_refine) if_refine=.true.
+            ich(iel)=4
+          end if
+        end if
+      enddo
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine check_refine(ifrepeat)
+c-----------------------------------------------------------------
+c     Check whether the potential refinement will violate the
+c     restriction. If so, mark the neighbor and unmark the
+c     original element, and set ifrepeat true. i.e. this procedure
+c     needs to be repeated until no further check is needed
+c-----------------------------------------------------------------
+
+      include 'header.h'
+ 
+      logical ifrepeat,ifcor
+      integer iel,iface,ntemp,nntemp,i,jface
+
+      ifrepeat=.false.
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,jface,ntemp,
+c$OMP& iface,nntemp) SHARED(ifrepeat)
+      do iel=1,nelt
+c.......if iel is marked to be refined
+        if(ich(iel).eq.4) then
+c.........check its six faces
+          do i=1,nsides
+            jface=jjface(i)
+            ntemp=sje(1,1,i,iel)
+c...........if one face neighbor is larger in size than iel
+            if(cbc(i,iel).eq.1) then
+c.............unmark iel
+              ich(iel)=0
+c.............the large size neighbor ntemp is marked to be refined
+              if(ich(ntemp).ne.4) then
+                if(.not.ifrepeat) ifrepeat=.true.
+                ich(ntemp)=4
+              end if
+c.............check iel's neighbor, neighbored by an edge on face i, which
+c             must be a face neighbor of ntemp
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+c................if edge neighbors are larger than iel, mark them to be refined
+                  if(cbc(iface,ntemp).eq.2) then
+                    nntemp=sje(1,1,iface,ntemp)
+c..................ifcor is to make sure the edge neighbor exist
+                    if(ich(nntemp).ne.4.and.
+     &                 ifcor(iel,nntemp,i,iface))then
+                      ich(nntemp)=4
+                    end if
+                  end if
+                end if
+              end do
+c...........if face neighbor are of the same size of iel, check edge neighbors
+            elseif(cbc(i,iel).eq.2)then
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+                  if(cbc(iface,ntemp).eq.1)then
+                    nntemp=sje(1,1,iface,ntemp)
+                    ich(nntemp)=4
+                    ich(iel)=0
+                    if(.not.ifrepeat) ifrepeat=.true.
+                  end if
+                end if
+              end do
+            end if
+          enddo
+        end if
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      logical function iftouch(iel)
+c-----------------------------------------------------------------
+c     check whether element iel has overlap with the heat source
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dis, dis1, dis2, dis3, alpha2
+      integer iel
+
+      alpha2 = alpha*alpha
+
+      if     (x0 .lt. xc(1,iel)) then
+        dis1 = xc(1,iel) - x0
+      elseif (x0 .gt. xc(2,iel)) then
+        dis1 = x0 - xc(2,iel)
+      else
+        dis1 = 0.d0
+      endif
+
+      if     (y0 .lt. yc(1,iel)) then
+        dis2 = yc(1,iel) - y0
+      elseif (y0 .gt. yc(3,iel)) then
+        dis2 = y0 - yc(3,iel)
+      else
+        dis2 = 0.d0
+      endif
+
+      if     (z0 .lt. zc(1,iel)) then
+        dis3 = zc(1,iel) - z0
+      elseif (z0 .gt. zc(5,iel)) then
+        dis3 = z0 - zc(5,iel)
+      else
+       dis3 = 0.d0
+      endif
+
+      dis = dis1**2+dis2**2+dis3**2
+
+      if (dis .lt. alpha2) then
+       iftouch=.true.
+      else
+       iftouch=.false.
+      end if
+
+      return
+      end
+
+
+c-----------------------------------------------------------------
+      subroutine remap (y,y1,x) 
+c-----------------------------------------------------------------
+c     After a refinement, map the solution  from the parent (x) to
+c     the eight children. y is the solution on the first child
+c     (front-bottom-left) and y1 is the solution on the next 7 
+c     children.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x(lx1,lx1,lx1),y(lx1,lx1,lx1),y1(lx1,lx1,lx1,7),
+     &       yone(lx1,lx1,lx1,2), ytwo(lx1,lx1,lx1,4)
+      integer i, iz, ii, jj, kk
+
+      call r_init(y,lx1*lx1*lx1,0.d0)
+      call r_init(y1,lx1*lx1*lx1*7,0.d0)
+      call r_init(yone,lx1*lx1*lx1*2,0.d0)
+      call r_init(ytwo,lx1*lx1*lx1*4,0.d0)
+
+      do  i=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              yone(ii,jj,i,1) = yone(ii,jj,i,1) +ixmc1(ii,kk)*x(kk,jj,i)
+              yone(ii,jj,i,2) = yone(ii,jj,i,2) +ixmc2(ii,kk)*x(kk,jj,i)
+            end do
+          end do
+        end do
+
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              ytwo(ii,i,jj,1) = ytwo(ii,i,jj,1) + 
+     &                          yone(ii,kk,i,1)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,2) = ytwo(ii,i,jj,2) + 
+     &                          yone(ii,kk,i,1)*ixtmc2(kk,jj)
+              ytwo(ii,i,jj,3) = ytwo(ii,i,jj,3) + 
+     &                          yone(ii,kk,i,2)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,4) = ytwo(ii,i,jj,4) + 
+     &                          yone(ii,kk,i,2)*ixtmc2(kk,jj)
+            end do
+          end do
+        end do
+      end do
+
+      do  iz=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              y(ii,iz,jj) = y(ii,iz,jj) +
+     &                        ytwo(ii,kk,iz,1)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,1) = y1(ii,iz,jj,1) +
+     &                        ytwo(ii,kk,iz,3)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,2) = y1(ii,iz,jj,2) +
+     &                        ytwo(ii,kk,iz,2)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,3) = y1(ii,iz,jj,3) +
+     &                        ytwo(ii,kk,iz,4)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,4) = y1(ii,iz,jj,4) +
+     &                        ytwo(ii,kk,iz,1)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,5) = y1(ii,iz,jj,5) +
+     &                        ytwo(ii,kk,iz,3)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,6) = y1(ii,iz,jj,6) +
+     &                        ytwo(ii,kk,iz,2)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,7) = y1(ii,iz,jj,7) +
+     &                        ytwo(ii,kk,iz,4)*ixtmc2(kk,jj)            
+            end do
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+c=======================================================================
+      subroutine merging(iela)
+c-----------------------------------------------------------------------
+c     This subroutine is to merge the eight child elements and map 
+c     the solution from eight children to the  merged element. 
+c     iela array records the eight elements to be merged.
+c-----------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1,x2,y1,y2,z1,z2
+      integer ielnew,i,ntemp,jface,ii,cb,ntempa(4),iela(8),ielold,
+     &        ntema(4)
+
+      ielnew=iela(1)
+
+      tree(ielnew)=ishft(tree(ielnew),-3)   
+
+c.....element vertices 
+      x1=xc(1,iela(1))
+      x2=xc(2,iela(2))
+      y1=yc(1,iela(1))
+      y2=yc(3,iela(3))
+      z1=zc(1,iela(1))
+      z2=zc(5,iela(5))
+
+      do i=1,7,2
+        xc(i,ielnew)=x1
+      end do
+      do i=2,8,2
+        xc(i,ielnew)=x2
+      end do
+      do i=1,2
+        yc(i,ielnew)=y1
+        yc(i+4,ielnew)=y1
+      end do
+      do i=3,4
+        yc(i,ielnew)=y2
+        yc(i+4,ielnew)=y2
+      end do
+      do i=1,4
+        zc(i,ielnew)=z1
+      end do
+      do i=5,8
+        zc(i,ielnew)=z2
+      end do
+
+c.....update neighboring information
+      do i=1,nsides
+        jface=jjface(i)
+        ielold=iela(children(1,i))
+        do ii=1,4
+          ntempa(ii)=iela(children(ii,i))
+        end do
+
+        cb=cbc(i,ielold)
+       
+        if (cb.eq.2) then
+c.........if the neighbor elements also will be coarsened
+          if(ifcoa_id(sje(1,1,i,ielold)))then
+            if (i.eq.2 .or. i.eq. 4 .or. i.eq.6) then
+              ntemp=sje(1,1,i,sje(1,1,i,ntempa(1)))
+            else
+              ntemp=sje(1,1,i,ntempa(1))
+            end if 
+            sje(1,1,i,ielnew)=ntemp
+            ijel(1,i,ielnew)=1
+            ijel(2,i,ielnew)=1
+            cbc(i,ielnew)=2
+
+c.........if the neighbor elements will not be coarsened
+          else
+            do ii=1,4
+              ntema(ii)=sje(1,1,i,ntempa(ii)) 
+              cbc(jface,ntema(ii))=1
+              sje(1,1,jface,ntema(ii))=ielnew
+              ijel(1,jface,ntema(ii))=iijj(1,ii)
+              ijel(2,jface,ntema(ii))=iijj(2,ii)
+              sje(iijj(1,ii),iijj(2,ii),i,ielnew)=ntema(ii)
+              ijel(1,i,ielnew)=1
+              ijel(2,i,ielnew)=1
+            end do
+            cbc(i,ielnew)=3
+          end if       
+
+        else if(cb.eq.1)then
+
+          ntemp=sje(1,1,i,ielold)
+          cbc(jface,ntemp)=2
+          ijel(1,jface,ntemp)=1
+          ijel(2,jface,ntemp)=1
+          sje(1,1,jface,ntemp)=ielnew
+          sje(1,2,jface,ntemp)=0
+          sje(2,1,jface,ntemp)=0
+          sje(2,2,jface,ntemp)=0
+           
+          cbc(i,ielnew)=2
+          ijel(1,i,ielnew)=1
+          ijel(2,i,ielnew)=1
+          sje(1,1,i,ielnew)=ntemp
+         
+        else if(cb.eq.0)then
+          cbc(i,ielnew)=0
+          sje(1,1,i,ielnew)=0
+          sje(1,2,i,ielnew)=0
+          sje(2,1,i,ielnew)=0
+          sje(2,2,i,ielnew)=0
+        endif
+
+      end do
+
+c.....map solution from children to the merged element
+      call remap2(iela, ielnew)
+      
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine remap2(iela, ielnew)
+c-----------------------------------------------------------------
+c     Map the solution from the children to the parent.
+c     iela array records the eight elements to be merged.
+c     ielnew is the element index of the merged element.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+      integer iela(8), ielnew
+
+      double precision temp1(lx1,lx1,lx1),
+     &       temp2(lx1,lx1,lx1),temp3(lx1,lx1,lx1),temp4(lx1,lx1,lx1),
+     &       temp5(lx1,lx1,lx1),temp6(lx1,lx1,lx1)
+
+      call remapx(ta1(1,1,1,iela(1)),ta1(1,1,1,iela(2)),temp1)
+      call remapx(ta1(1,1,1,iela(3)),ta1(1,1,1,iela(4)),temp2)
+      call remapx(ta1(1,1,1,iela(5)),ta1(1,1,1,iela(6)),temp3)
+      call remapx(ta1(1,1,1,iela(7)),ta1(1,1,1,iela(8)),temp4)
+      call remapy(temp1,temp2,temp5)
+      call remapy(temp3,temp4,temp6)
+      call remapz(temp5,temp6,ta1(1,1,1,ielnew))
+
+      return
+      end       
+
+c-----------------------------------------------------------------
+      subroutine remapz(x1,x2,y)
+c-----------------------------------------------------------------
+c     z direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iy, ip
+
+      do iy=1,lx1
+        do ix=1,lx1
+          y(ix,iy,1)=x1(ix,iy,1)
+
+          y(ix,iy,2)=0.d0
+          do ip=1,lx1
+            y(ix,iy,2)=y(ix,iy,2)+map2(ip)*x1(ix,iy,ip)
+          end do
+
+          y(ix,iy,3)=x1(ix,iy,lx1)
+
+          y(ix,iy,4)=0.d0
+          do ip=1,lx1
+            y(ix,iy,4)=y(ix,iy,4)+map4(ip)*x2(ix,iy,ip)
+          end do
+
+          y(ix,iy,lx1)=x2(ix,iy,lx1)
+        end do
+      end do
+
+      return
+      end      
+
+c-----------------------------------------------------------------
+      subroutine remapy(x1,x2,y)
+c-----------------------------------------------------------------
+c     y direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iz, ip
+
+      do iz=1,lx1
+        do ix=1,lx1
+          y(ix,1,iz)=x1(ix,1,iz)
+
+          y(ix,2,iz)=0.d0
+          do ip=1,lx1
+            y(ix,2,iz)=y(ix,2,iz)+map2(ip)*x1(ix,ip,iz)
+          end do
+
+          y(ix,3,iz)=x1(ix,lx1,iz)
+
+          y(ix,4,iz)=0.d0
+          do ip=1,lx1
+            y(ix,4,iz)=y(ix,4,iz)+map4(ip)*x2(ix,ip,iz)
+          end do
+
+          y(ix,lx1,iz)=x2(ix,lx1,iz)
+        end do
+      end do
+
+      return
+      end      
+
+c-----------------------------------------------------------------
+      subroutine remapx(x1,x2,y)
+c-----------------------------------------------------------------
+c     x direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer iy, iz, ip
+
+      do iz=1,lx1
+        do iy=1,lx1
+          y(1,iy,iz)=x1(1,iy,iz)
+
+          y(2,iy,iz)=0.d0
+          do ip=1,lx1
+            y(2,iy,iz)=y(2,iy,iz)+map2(ip)*x1(ip,iy,iz)
+          end do
+
+          y(3,iy,iz)=x1(lx1,iy,iz)
+
+          y(4,iy,iz)=0.d0
+          do ip=1,lx1
+            y(4,iy,iz)=y(4,iy,iz)+map4(ip)*x2(ip,iy,iz)
+          end do
+
+          y(lx1,iy,iz)=x2(lx1,iy,iz)
+        end do
+      end do
+
+      return
+      end      
+       
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/convect.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/convect.f
new file mode 100644
index 0000000..79a1044
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/convect.f
@@ -0,0 +1,226 @@
+c---------------------------------------------------------
+      subroutine convect(ifmortar)  
+c---------------------------------------------------------
+c     Advance the convection term using 4th order RK
+c     1.ta1 is solution from last time step 
+c     2.the heat source is considered part of d/dx
+c     3.trhs is right hand side for the diffusion equation
+c     4.tmor is solution on mortar points, which will be used
+c       as the initial guess when advancing the diffusion term 
+c---------------------------------------------------------
+
+      include 'header.h'
+
+      double precision alpha2, tempa(lx1,lx1,lx1), 
+     &       rdtime, pidivalpha, sixth,
+     &       dtx1, dtx2, dtx3, src, rk1(lx1,lx1,lx1), rk2(lx1,lx1,lx1),
+     &       rk3(lx1,lx1,lx1), rk4(lx1,lx1,lx1), temp(lx1,lx1,lx1), 
+     &       subtime(3), xx0(3), yy0(3), zz0(3), dtime2, r2, sum,
+     &       xloc(lx1), yloc(lx1), zloc(lx1)
+      integer k,iel,i,j,iside,isize, substep, ip
+      logical ifmortar
+      parameter (sixth=1.d0/6.d0)
+
+      if (timeron) call timer_start(t_convect)
+      pidivalpha = dacos(-1.d0)/alpha
+      alpha2     = alpha*alpha
+      dtime2     = dtime/2.d0 
+      rdtime     = 1.d0/dtime
+      subtime(1) = time
+      subtime(2) = time+dtime2
+      subtime(3) = time+dtime
+      do substep = 1, 3
+        xx0(substep) = x00+velx*subtime(substep)
+        yy0(substep) = y00+vely*subtime(substep)
+        zz0(substep) = z00+velz*subtime(substep)
+      end do
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(rk4,rk3,rk2,temp,rk1,dtx3,
+c$OMP$ dtx2,dtx1,iside,ip,sum,src,r2,i,j,k,isize,iel,tempa,
+c$OMP$ xloc,yloc,zloc)
+
+      do iel = 1, nelt
+        isize=size_e(iel)
+c.......xloc(i) is the location of i'th collocation in x direction in an element.
+c       yloc(i) is the location of j'th collocation in y direction in an element.
+c       zloc(i) is the location of k'th collocation in z direction in an element.
+        do i = 1, lx1
+          xloc(i) = xfrac(i)*(xc(2,iel)-xc(1,iel))+xc(1,iel)
+        end do
+        do j = 1, lx1
+          yloc(j) = xfrac(j)*(yc(4,iel)-yc(1,iel))+yc(1,iel)
+        end do
+        do k = 1, lx1
+          zloc(k) = xfrac(k)*(zc(5,iel)-zc(1,iel))+zc(1,iel)
+        end do
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(1))**2+(yloc(j)-yy0(1))**2+
+     &             (zloc(k)-zz0(1))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * ta1(ip,j,k,iel)
+              end do
+              dtx1 = -velx*sum*xrm1_s(i,j,k,isize)
+              sum  = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * ta1(i,ip,k,iel)
+              end do
+              dtx2=-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * ta1(i,j,ip,iel)
+              end do
+              dtx3=-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk1(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime2*rk1(i,j,k)
+
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk2(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=ta1(i,j,k,iel)+dtime2*rk2(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * tempa(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * tempa(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * tempa(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk3(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime*rk3(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(3))**2 + (yloc(j)-yy0(3))**2 +
+     &             (zloc(k)-zz0(3))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk4(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=sixth*(rk1(i,j,k)+2.d0*
+     &                   rk2(i,j,k)+2.d0*rk3(i,j,k)+rk4(i,j,k))
+            end do
+          end do
+        end do        
+
+c.......apply boundary condition
+        do iside=1,nsides
+          if(cbc(iside,iel).eq.0)then
+            call facev(tempa,iside,0.0d0)
+          end if
+        end do
+          
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              trhs(i,j,k,iel)=bm1_s(i,j,k,isize)*(ta1(i,j,k,iel)*rdtime+
+     &                        tempa(i,j,k))
+              ta1(i,j,k,iel)=ta1(i,j,k,iel)+tempa(i,j,k)*dtime
+            end do
+          end do
+        end do
+
+      end do 
+c$OMP END PARALLEL DO
+
+c.....get mortar for intial guess for CG
+
+      if (timeron) call timer_start(t_transfb_c)
+      if(ifmortar)then
+        call transfb_c_2(ta1)
+      else
+        call transfb_c(ta1)
+      end if
+      if (timeron) call timer_stop(t_transfb_c)
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,nmor
+       tmort(i)=tmort(i)/mormult(i)
+      end do
+c$OMP END PARALLEL DO
+      if (timeron) call timer_stop(t_convect)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/diffuse.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/diffuse.f
new file mode 100644
index 0000000..718a1dd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/diffuse.f
@@ -0,0 +1,243 @@
+c---------------------------------------------------------------------
+      subroutine diffusion(ifmortar)      
+c---------------------------------------------------------------------
+c     advance the diffusion term using CG iterations
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision  rho_aux, rho1, rho2, beta, cona
+      logical ifmortar
+      integer iter,ie, im,iside,i,j,k
+
+      if (timeron) call timer_start(t_diffusion)
+c.....set up diagonal preconditioner
+      if (ifmortar) then
+        call setuppc
+        call setpcmo
+      end if
+
+c.....arrays t and umor are accumlators of (am pm) in the CG algorithm
+c     (see the specification)
+
+      call r_init_omp(t,ntot,0.d0)
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,nmor
+        umor(i)=0.d0
+      end do
+c$OMP END PARALLEL DO
+
+c.....calculate initial am (see specification) in CG algorithm
+
+c.....trhs and rmor are combined to generate r0 in CG algorithm.
+c     pdiff and pmorx are combined to generate q0 in the CG algorithm.
+c     rho1 is  (qm,rm) in the CG algorithm.
+
+      rho1 = 0.d0
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:rho1)
+c$OMP DO
+       do ie=1,nelt
+         do k=1,lx1
+           do j=1,lx1
+             do i=1,lx1
+               pdiff(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+               rho1            = rho1 + trhs(i,j,k,ie)*pdiff(i,j,k,ie)*
+     &                                          tmult(i,j,k,ie)
+             end do
+           end do
+         end do
+       end do
+c$OMP END DO nowait
+
+c$OMP DO
+      do im = 1, nmor
+        pmorx(im) = dpcmor(im)*rmor(im)
+        rho1      = rho1 + rmor(im)*pmorx(im)
+      end do
+c$OMP END DO nowait
+c$OMP END PARALLEL
+
+c.................................................................
+c     commence conjugate gradient iteration
+c.................................................................
+
+      do iter=1, nmxh
+        if(iter.gt.1) then 
+          rho_aux = 0.d0
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:rho_aux)
+c$OMP DO
+c.........pdiffp and ppmor are combined to generate q_m+1 in the specification
+c         rho_aux is (q_m+1,r_m+1)
+          do ie = 1, nelt
+            do k=1,lx1
+              do j=1,lx1
+                do i=1,lx1
+                  pdiffp(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+                  rho_aux =rho_aux+trhs(i,j,k,ie)*pdiffp(i,j,k,ie)*
+     &                                            tmult(i,j,k,ie)
+                end do
+              end do
+            end do
+          end do
+c$OMP END DO nowait
+c$OMP DO
+          do im = 1, nmor
+            ppmor(im) = dpcmor(im)*rmor(im)
+            rho_aux = rho_aux + rmor(im)*ppmor(im)
+          end do
+c$OMP END DO nowait
+c$OMP END PARALLEL
+
+c.........compute bm (beta) in the specification
+          rho2 = rho1
+          rho1 = rho_aux
+          beta = rho1/rho2
+c.........update p_m+1 in the specification
+          call adds1m1(pdiff, pdiffp, beta,ntot)
+          call adds1m1(pmorx, ppmor,  beta, nmor)  
+        end if
+ 
+c.......compute matrix vector product: (theta pm) in the specification
+
+        if (timeron) call timer_start(t_transf)
+        call transf(pmorx,pdiff) 
+        if (timeron) call timer_stop(t_transf)
+
+c.......compute pdiffp which is (A theta pm) in the specification
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie) 
+        do ie=1, nelt
+          call laplacian(pdiffp(1,1,1,ie),pdiff(1,1,1,ie),size_e(ie))
+        end do
+c$OMP END PARALLEL DO
+
+c.......compute ppmor which will be used to compute (thetaT A theta pm) 
+c       in the specification
+        if (timeron) call timer_start(t_transfb)
+        call transfb(ppmor,pdiffp) 
+        if (timeron) call timer_stop(t_transfb)
+ 
+c.......apply boundary condition
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,iside)
+        do ie=1,nelt
+          do iside=1,nsides
+            if(cbc(iside,ie).eq.0)then
+              call facev(pdiffp(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+c$OMP END PARALLEL DO
+
+c.......compute cona which is (pm,theta T A theta pm)
+        cona = 0.d0
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:cona)
+c$OMP DO
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                cona = cona + 
+     &          pdiff(i,j,k,ie)*pdiffp(i,j,k,ie)*tmult(i,j,k,ie)
+              end do 
+             end do 
+          end do 
+        end do 
+c$OMP END DO nowait
+c$OMP DO
+        do im = 1, nmor
+          ppmor(im) = ppmor(im)*tmmor(im)
+          cona = cona + pmorx(im)*ppmor(im)
+        end do
+c$OMP END DO nowait
+c$OMP END PARALLEL
+
+c.......compute am
+        cona = rho1/cona
+c.......compute (am pm)
+        call adds2m1(t,    pdiff,   cona, ntot)
+        call adds2m1(umor, pmorx,   cona, nmor) 
+c.......compute r_m+1
+        call adds2m1(trhs, pdiffp, -cona, ntot)
+        call adds2m1(rmor, ppmor,  -cona, nmor) 
+ 
+      end do
+
+      if (timeron) call timer_start(t_transf)
+      call transf(umor,t)  
+      if (timeron) call timer_stop(t_transf)
+      if (timeron) call timer_stop(t_diffusion)
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine laplacian(r,u,sizei)
+c------------------------------------------------------------------
+c     compute  r = visc*[A]x +[B]x on a given element.
+c------------------------------------------------------------------
+      include 'header.h'
+
+      double precision r(lx1,lx1,lx1), u(lx1,lx1,lx1), rdtime
+      integer i,j,k, ix,iz, sizei
+
+      double precision tm1(lx1,lx1,lx1),tm2(lx1,lx1,lx1)                     
+
+      rdtime = 1.d0/dtime
+
+      call r_init(tm1,nxyz,0.d0)
+      do iz=1,lx1                     
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm1(i,j,iz) = tm1(i,j,iz)+wdtdr(i,k)*u(k,j,iz)
+            end do
+          end do
+        end do                           
+      end do
+              
+      call r_init(tm2,nxyz,0.d0)                                                   
+      do iz=1,lx1                                            
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm2(i,j,iz) = tm2(i,j,iz)+u(i,k,iz)*wdtdr(k,j)
+            end do
+          end do
+        end do
+      end do
+                                                            
+      call r_init(r,nxyz,0.d0)   
+      do k = 1, lx1
+        do iz=1, lx1    
+          do j = 1, lx1
+            do i = 1, lx1
+              r(i,j,iz) = r(i,j,iz)+u(i,j,k)*wdtdr(k,iz)
+            end do
+          end do
+        end do
+      end do
+
+c.....collocate with remaining weights and sum to complete factorization.                   
+                                                      
+c      do ix=1,nxyz                                            
+c         r(ix,1,1)=visc*(tm1(ix,1,1)*g4m1_s(ix,1,1,sizei)+
+c     &                   tm2(ix,1,1)*g5m1_s(ix,1,1,sizei)+
+c     &                     r(ix,1,1)*g6m1_s(ix,1,1,sizei))+
+c     &               bm1_s(ix,1,1,sizei)*rdtime*u(ix,1,1)             
+c      end do
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            r(i,j,k)=visc*(tm1(i,j,k)*g4m1_s(i,j,k,sizei)+
+     &                   tm2(i,j,k)*g5m1_s(i,j,k,sizei)+
+     &                    r(i,j,k)*g6m1_s(i,j,k,sizei))+
+     &               bm1_s(i,j,k,sizei)*rdtime*u(i,j,k)             
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                    
+
+
+ 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/header.h
new file mode 100644
index 0000000..562cbb6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/header.h
@@ -0,0 +1,188 @@
+      implicit none
+
+      include 'npbparams.h'
+
+c.....Array dimensions     
+      integer lx1, lnje, nsides, nxyz
+      parameter(lx1=5, lnje=2, nsides=6,  nxyz=lx1*lx1*lx1)
+
+      integer fre, niter, nmxh
+      double precision alpha, dlmin, dtime
+      common /usrdati/ fre, niter, nmxh
+      common /usrdatr/ alpha, dlmin, dtime
+
+      integer nelt, ntot, nmor, nvertex
+      common /dimn/ nelt,ntot, nmor, nvertex
+
+      double precision x0, y0, z0, time
+      common /bench1/time, x0, y0, z0
+
+      double precision velx, vely, velz, visc, x00, y00, z00
+      parameter(velx=3.d0, vely=3.d0, velz=3.d0)
+      parameter(visc=0.005d0)
+      parameter(x00=3.d0/7.d0, y00=2.d0/7.d0, z00=2.d0/7.d0)
+
+c.....double precision arrays associated with collocation points
+      double precision
+     &       ta1  (lx1,lx1,lx1,lelt), ta2   (lx1,lx1,lx1,lelt),
+     &       trhs (lx1,lx1,lx1,lelt), t     (lx1,lx1,lx1,lelt), 
+     &       tmult(lx1,lx1,lx1,lelt), dpcelm(lx1,lx1,lx1,lelt), 
+     &       pdiff(lx1,lx1,lx1,lelt), pdiffp(lx1,lx1,lx1,lelt)
+      common /colldp/ ta1, ta2, trhs, t, tmult, dpcelm, pdiff, pdiffp
+
+c.....double precision arays associated with mortar points
+      double precision
+     &       umor(lmor), mormult(lmor), tmort(lmor), tmmor(lmor), 
+     &       rmor(lmor), dpcmor (lmor), pmorx(lmor), ppmor(lmor) 
+      common /mortdp/ umor, mormult, tmort,tmmor, rmor, dpcmor, 
+     &                pmorx, ppmor
+
+c.... integer arrays associated with element faces
+      integer idmo    (lx1,lx1,lnje,lnje,nsides,lelt), 
+     &        idel    (lx1,lx1,          nsides,lelt), 
+     &        sje     (2,2,              nsides,lelt), 
+     &        sje_new (2,2,              nsides,lelt),
+     &        ijel    (2,                nsides,lelt), 
+     &        ijel_new(2,                nsides,lelt),
+     &        cbc     (                  nsides,lelt), 
+     &        cbc_new (                  nsides,lelt) 
+      common /facein/ idmo, ijel, idel, ijel_new, sje, sje_new, cbc,
+     &               cbc_new
+
+c.....integer array associated with vertices
+      integer vassign  (8,lelt),      emo(2,8,8*lelt),   
+     &        nemo (8*    lelt)
+      common /vin/vassign, emo, nemo
+
+c.....integer array associated with element edges
+      integer diagn  (2,12,lelt) 
+      common /edgein/diagn 
+
+c.... integer arrays associated with elements
+      integer tree (      lelt), mt_to_id    (     lelt),                   
+     &        newc (      lelt), mt_to_id_old(     lelt),
+     &        newi (      lelt), id_to_mt    (     lelt), 
+     &        newe (      lelt), ref_front_id(     lelt),
+     &        front(      lelt), action      (     lelt), 
+     &        ich  (      lelt), size_e      (     lelt),
+     &        treenew     (     lelt)
+      common /eltin/ tree, treenew,mt_to_id,mt_to_id_old,
+     &               id_to_mt, newc, newi, newe, ref_front_id, 
+     &               ich, size_e, front, action
+
+c.....logical arrays associated with vertices
+      logical ifpcmor  (8* lelt)
+      common /vlg/ ifpcmor
+
+c.....logical arrays associated with edge
+      logical eassign  (12,lelt),  if_1_edge(12,lelt), 
+     &        ncon_edge(12,lelt)
+      common /edgelg/ eassign,  ncon_edge, if_1_edge
+
+c.....logical arrays associated with elements
+      logical skip (lelt), ifcoa   (lelt), ifcoa_id(lelt)
+      common /facelg/ skip, ifcoa, ifcoa_id
+
+c.....logical arrays associated with element faces
+      logical fassign(nsides,lelt), edgevis(4,nsides,lelt)      
+      common /masonl/ fassign, edgevis
+
+c.....small arrays
+      double precision qbnew(lx1-2,lx1,2), bqnew(lx1-2,lx1-2,2)
+      common /transr/ qbnew,bqnew
+
+      double precision
+     &       pcmor_nc1(lx1,lx1,2,2,refine_max),
+     $       pcmor_nc2(lx1,lx1,2,2,refine_max),
+     $       pcmor_nc0(lx1,lx1,2,2,refine_max),
+     $       pcmor_c(lx1,lx1,refine_max), tcpre(lx1,lx1),
+     $       pcmor_cor(8,refine_max)
+      common /pcr/ pcmor_nc1,pcmor_c,pcmor_nc0,pcmor_nc2,tcpre, 
+     $             pcmor_cor
+
+c.....gauss-labotto and gauss points
+      double precision zgm1(lx1)
+      common /gauss/ zgm1
+
+c.....weights
+      double precision wxm1(lx1),w3m1(lx1,lx1,lx1)
+      common /wxyz/ wxm1,w3m1
+
+c.....coordinate of element vertices
+      double precision xc(8,lelt),yc(8,lelt),zc(8,lelt),
+     $       xc_new(8,lelt),yc_new(8,lelt),zc_new(8,lelt)
+      common /coord/ xc,yc,zc,xc_new,yc_new,zc_new
+
+c.....dr/dx, dx/dr  and Jacobian
+      double precision jacm1_s(lx1,lx1,lx1,refine_max), 
+     $       rxm1_s(lx1,lx1,lx1,refine_max),
+     $       xrm1_s(lx1,lx1,lx1,refine_max)
+      common /giso/ jacm1_s,xrm1_s, rxm1_s 
+
+c.....mass matrices (diagonal)
+      double precision bm1_s(lx1,lx1,lx1,refine_max)
+      common /mass/ bm1_s
+
+c.....dertivative matrices d/dr
+      double precision dxm1(lx1,lx1), dxtm1(lx1,lx1), wdtdr(lx1,lx1)
+      common /dxyz/ dxm1,dxtm1,wdtdr
+
+c.....interpolation operators
+      double precision
+     $       ixm31(lx1,lx1*2-1), ixtm31(lx1*2-1,lx1), ixmc1(lx1,lx1),  
+     $       ixtmc1(lx1,lx1), ixmc2(lx1,lx1),  ixtmc2(lx1,lx1),
+     $       map2(lx1),map4(lx1)
+      common /ixyz/ ixmc1,ixtmc1,ixmc2,ixtmc2,ixm31,ixtm31,map2,map4
+
+c.....collocation location within an element
+      double precision xfrac
+      common /xfracs/xfrac(lx1)
+
+c.....used in laplacian operator
+      double precision g1m1_s(lx1,lx1,lx1,refine_max), 
+     $       g4m1_s(lx1,lx1,lx1,refine_max),
+     $       g5m1_s(lx1,lx1,lx1,refine_max),
+     $       g6m1_s(lx1,lx1,lx1,refine_max)
+      common /gmfact/ g1m1_s,g4m1_s,g5m1_s, g6m1_s
+      
+c.....We store some tables of useful topological constants
+c     These constants are intialized in a block data 'top_constants'
+      integer f_e_ef(4,6)
+      integer e_c(3,8)
+      integer local_corner(8,6)
+      integer cal_nnb(3,8)
+      integer oplc(4)
+      integer cal_iijj(2,4)
+      integer cal_intempx(4,6)
+      integer c_f(4,6)
+      integer le_arr(4,0:1,3)
+      integer jjface(6)
+      integer e_face2(4,6)
+      integer op(4)
+      integer localedgenumber(6,12)
+      integer edgenumber(4,6)
+      integer f_c(3,8)
+      integer e1v1(6,6),e2v1(6,6),e1v2(6,6),e2v2(6,6)
+      integer children(4,6)
+      integer iijj(2,4)
+      integer v_end(2)
+      integer face_l1(3),face_l2(3),face_ld(3)
+      common /top_consts/ f_e_ef,e_c,local_corner,cal_nnb,oplc,
+     $       cal_iijj,cal_intempx,c_f,le_arr,jjface,e_face2,op,
+     $       localedgenumber,edgenumber,f_c,e1v1,e2v1,e1v2,e2v2,
+     $       children,iijj,v_end,face_l1,face_l2,face_ld
+
+c ... Timer parameters
+      integer t_total,t_init,t_convect,t_transfb_c,
+     &        t_diffusion,t_transf,t_transfb,t_adaptation,
+     &        t_transf2,t_add2,t_last
+      parameter (t_total=1,t_init=2,t_convect=3,t_transfb_c=4,
+     &        t_diffusion=5,t_transf=6,t_transfb=7,t_adaptation=8,
+     &        t_transf2=9,t_add2=10,t_last=10)
+      logical timeron
+      common /timing/timeron
+
+c.....Locks used for atomic updates
+cc    integer (kind=omp_lock_kind) tlock(lmor)
+c$    integer*8 tlock(lmor)
+c$    common /sync_cmn/ tlock
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/mason.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/mason.f
new file mode 100644
index 0000000..058f0ca
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/mason.f
@@ -0,0 +1,2272 @@
+c-----------------------------------------------------------------
+      subroutine mortar
+c-----------------------------------------------------------------
+c     generate mortar point index number 
+c-----------------------------------------------------------------
+      include 'header.h'
+
+      integer count, iel, jface, ntemp, i, ii, jj, ntemp1,
+     &        iii, jjj, face2, ne, ie, edge_g, ie2,
+     &        mor_v(3), cb, cb1, cb2, cb3, cb4, cb5, cb6,
+     &        space, sumcb, ij1, ij2, n1, n2, n3, n4, n5
+
+      n1=lx1*lx1*6*4*nelt
+      n2=8*nelt
+      n3=2*64*nelt
+      n4=12*nelt
+      n5=2*12*nelt
+
+      call nr_init_omp(idmo,n1,0)
+      call nr_init_omp(nemo,n2,0)
+      call nr_init_omp(vassign,n2,0)
+      call nr_init_omp(emo,n3,0)
+      call  l_init_omp(if_1_edge,n4,.false.)
+      call nr_init_omp(diagn,n5,0)
+c.....Mortar points indices are generated in two steps: first generate 
+c     them for all element vertices (corner points), then for conforming 
+c     edge and conforming face interiors. Each time a new mortar index 
+c     is generated for a mortar point, it is broadcast to all elements 
+c     sharing this mortar point. 
+
+c.....VERTICES
+      count=0
+
+c.....assign mortar point indices to element vertices
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,sumcb,ij1,ij2,
+c$OMP& cb,cb1,cb2,ntemp,ntemp1)
+
+      do iel=1,nelt
+
+c.......first calculate how many new mortar indices will be generated for 
+c       each element.
+
+c.......For each element, at least one vertex (vertex 8) will be new mortar
+c       point. All possible new mortar points will be on face 2,4 or 6. By
+c       checking the type of these three faces, we are able to tell
+c       how many new mortar vertex points will be generated in each element.
+
+        cb=cbc(6,iel)
+        cb1=cbc(4,iel)
+        cb2=cbc(2,iel)
+
+c.......For different combinations of the type of these three faces,
+c       we group them into 27 configurations.
+c       For different face types we assign the following integers:
+c              1 for type 2 or 3
+c              2 for type 0
+c              5 for type 1
+c       By summing these integers for faces 2,4 and 6, sumcb will have 
+c       10 different numbers indicating 10 different combinations. 
+
+        sumcb=0
+        if(cb.eq.2.or.cb.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb1.eq.2.or.cb1.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb1.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb1.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb2.eq.2.or.cb2.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb2.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb2.eq.1)then
+          sumcb=sumcb+5
+        end if
+
+c.......compute newc(iel)
+c       newc(iel) records how many new mortar indices will be generated
+c                 for element iel
+c       vassign(i,iel) records the element vertex of the i'th new mortar 
+c                 vertex point for element iel. e.g. vassign(2,iel)=8 means
+c                 the 2nd new mortar vertex point generated on element
+c                 iel is iel's 8th vertex.
+ 
+        if(sumcb.eq.3)then
+c.......the three face types for face 2,4, and 6 are 2 2 2
+          newc(iel)=1
+          vassign(1,iel)=8
+          
+        elseif(sumcb.eq.4)then
+c.......the three face types for face 2,4 and 6 are 2 2 0 (not 
+c       necessarily in this order)
+          newc(iel)=2
+          if(cb.eq.0)then
+            vassign(1,iel)=4
+          elseif(cb1.eq.0)then
+            vassign(1,iel)=6
+          elseif(cb2.eq.0)then
+            vassign(1,iel)=7
+          end if
+          vassign(2,iel)=8
+
+        elseif(sumcb.eq.7)then
+c.......the three face types for face 2,4 and 6 are 2 2 1 (not 
+c       necessarily in this order)
+          if(cb.eq.1)then
+            ij1=ijel(1,6,iel)
+            ij2=ijel(2,6,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=4
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              end if
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          elseif(cb1.eq.1)then
+            ij1=ijel(1,4,iel)
+            ij2=ijel(2,4,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=6
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+
+          elseif(cb2.eq.1)then
+            ij1=ijel(1,2,iel)
+            ij2=ijel(2,2,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=7
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          end if
+
+        elseif(sumcb.eq.5)then
+c.......the three face types for face 2,4 and 6 are 2/3 0 0 (not 
+c       necessarily in this order)
+          newc(iel)=4
+          if(cb.eq.2.or.cb.eq.3)then
+            vassign(1,iel)=5
+            vassign(2,iel)=6
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            vassign(1,iel)=3
+            vassign(2,iel)=4
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            vassign(1,iel)=2
+            vassign(2,iel)=4
+            vassign(3,iel)=6
+            vassign(4,iel)=8
+          end if
+
+        elseif(sumcb.eq.8)then
+c.......the three face types for face 2,4 and 6 are 2 0 1 (not 
+c       necessarily in this order)
+
+c.........if face 2 of type 1
+          if(cb.eq.1)then
+            if(cb1.eq.2.or.cb1.eq.3)then
+              ij1=ijel(1,6,iel)
+              if(ij1.eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else 
+                ntemp=sje(1,1,6,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+
+            elseif(cb2.eq.2.or.cb2.eq.3)then
+              if(ijel(2,6,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,6,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+c.........if face 4 of type 1
+          elseif(cb1.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              ij1=ijel(1,4,iel)
+              ij2=ijel(2,4,iel)
+
+              if(ij1.eq.1.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.1.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,4,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            endif
+c.........if face 6 of type 1
+          elseif(cb2.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              if(ijel(1,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          end if
+
+        elseif(sumcb.eq.11)then
+c.......the three face type for face 2,4 and 6 are 2 1 1(not 
+c       necessarily in this order)
+          if(cb.eq.2.or.cb.eq.3)then
+            if(ijel(1,4,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=6
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+
+c...........if ijel(1,4,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            if(ijel(2,2,iel).eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+c...........if ijel(2,2,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            if(ijel(2,6,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              end if
+c...........if ijel(2,6,iel)=2
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.
+     &            sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.
+     &              sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+          end if
+          
+        elseif(sumcb.eq.6)then
+c.......the three face type for face 2,4 and 6 are 0 0 0(not 
+c       necessarily in this order)
+          newc(iel)=8
+          vassign(1,iel)=1
+          vassign(2,iel)=2
+          vassign(3,iel)=3
+          vassign(4,iel)=4
+          vassign(5,iel)=5
+          vassign(6,iel)=6
+          vassign(7,iel)=7
+          vassign(8,iel)=8
+
+        elseif(sumcb.eq.9)then
+c.......the three face type for face 2,4 and 6 are 0 0 1(not 
+c       necessarily in this order)
+          newc(iel)=7
+          vassign(1,iel)=2
+          vassign(2,iel)=3
+          vassign(3,iel)=4
+          vassign(4,iel)=5
+          vassign(5,iel)=6
+          vassign(6,iel)=7
+          vassign(7,iel)=8
+
+        elseif(sumcb.eq.12)then
+c.......the three face type for face 2,4 and 6 are 0 1 1(not 
+c       necessarily in this order)
+          if(cb.eq.0)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(4,ntemp).eq.3.and.sje(1,1,4,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          elseif(cb1.eq.0)then
+            newc(iel)=7
+            vassign(1,iel)=2
+            vassign(2,iel)=3
+            vassign(3,iel)=4
+            vassign(4,iel)=5
+            vassign(5,iel)=6
+            vassign(6,iel)=7
+            vassign(7,iel)=8
+          elseif(cb2.eq.0)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=3
+              vassign(2,iel)=4
+              vassign(3,iel)=5
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          end if
+        
+        elseif(sumcb.eq.15)then
+c.......the three face type for face 2,4 and 6 are 1 1 1(not 
+c       necessarily in this order)
+          ntemp=sje(1,1,4,iel)
+          ntemp1=sje(1,1,2,iel)
+          if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=4
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                newc(iel)=5
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=4
+                vassign(2,iel)=5
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            end if
+          else
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=3
+                vassign(3,iel)=4
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+
+              else
+                newc(iel)=7
+                vassign(1,iel)=2 
+                vassign(2,iel)=3 
+                vassign(3,iel)=4 
+                vassign(4,iel)=5
+                vassign(5,iel)=6
+                vassign(6,iel)=7
+                vassign(7,iel)=8
+              end if
+            end if
+          end if
+        end if
+      end do
+c$OMP END PARALLEL DO
+c.....end computing how many new mortar vertex points will be generated
+c     on each element.
+
+c.....Compute (potentially in parallel) front(iel), which records how many 
+c     new mortar point indices are to be generated from element 1 to iel.
+c     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+      do iel=1,nelt
+        front(iel)=newc(iel)
+      end do
+c$OMP END PARALLEL DO
+
+      call parallel_add(front)
+
+c.....On each element, generate new mortar point indices and assign them
+c     to all elements sharing this mortar point. Note, if a mortar point 
+c     is shared by several elements, the mortar point index of it will only
+c     be generated on the element with the lowest element index. 
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,count)
+      do iel=1,nelt
+
+c.......compute the starting vertex mortar point index in element iel
+        front(iel)=front(iel)-newc(iel)
+
+        do i=1,newc(iel)
+c.........count is the new mortar index number, which will be assigned
+c         to a vertex of iel and broadcast to all other elements sharing
+c         this vertex point.
+          count=front(iel)+i
+          call mortar_vertex(vassign(i,iel),iel,count) 
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+c.....nvertex records how many mortar indices are for element vertices.
+c     It is used in the computation of the preconditioner.
+      count=front(nelt)+newc(nelt)
+      nvertex=count
+
+c.....CONFORMING EDGE AND FACE INTERIOR
+
+c.....find out how many new mortar point indices will be assigned to all
+c.....conforming edges and all conforming face interiors on each element
+
+      n1=12*nelt
+      n2=6*nelt
+
+c.....eassign(i,iel)=.true.   indicates that the i'th edge on iel will 
+c                             generate new mortar points. 
+c     ncon_edge(i,iel)=.true. indicates that the i'th edge on iel is 
+c                             nonconforming
+      call l_init_omp(ncon_edge,n1,.false.)
+      call l_init_omp(eassign,n1,.false.)
+c.....fassign(i,iel)=.true. indicates that the i'th face of iel will 
+c                           generate new mortar points
+      call l_init_omp(fassign,n2,.false.)
+
+c.....newe records how many new edges are to be assigned
+c     diagn(1,n,iel) records the element index of neighbor element of iel,
+c                    that shares edge n of iel
+c     diagn(2,n,iel) records the neighbor element diagn(1,n,iel) shares which
+c                    part of edge n of iel. diagn(2,n,iel)=1 refers to left
+c                    or bottom half of the edge n, diagn(2,n,iel)=2 refers
+c                    to the right or top part of edge n.
+c     if_1_edge(n,iel)=.true. indicates that the size of iel is smaller than 
+c                    that of its neighbor connected, neighbored by edge n only
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,cb1,cb2,cb3,cb4,cb5
+c$OMP& ,cb6,ntemp)
+
+      do iel=1,nelt
+        newc(iel)=0
+        newe(iel)=0
+        newi(iel)=0
+        cb1=cbc(1,iel)
+        cb2=cbc(2,iel)
+        cb3=cbc(3,iel)
+        cb4=cbc(4,iel)
+        cb5=cbc(5,iel)
+        cb6=cbc(6,iel)
+
+c.......on face 6
+
+        if(cb6.eq.0)then
+          if(cb4.eq.0.or.cb4.eq.1)then
+c...........if face 6 is of type 0 and face 4 is of type 0 or type 1, the edge
+c           shared by face 4 and 6 (edge 11) will generate new mortar point
+c           indices.
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          end if
+          if(cb1.ne.3)then
+c...........if face 1 is of type 3, the edge shared by face 6 and 1 (edge 1)
+c           will generate new mortar points indices.
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          end if
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          end if
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          end if
+        elseif(cb6.eq.1)then
+          if(cb4.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          elseif(cb4.eq.1)then
+
+c...........If face 6 and face 4 both are of type 1, ntemp is the neighbor
+c           element on face 4.
+            ntemp=sje(1,1,4,iel)
+
+c...........if ntemp's face 6 is not noncoforming or the neighbor element
+c           of ntemp on face 6 has an element index larger than iel, the 
+c           edge shared by face 6 and 4 (edge 11) will generate new mortar
+c           point indices.
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+
+              newe(iel)=newe(iel)+1
+              eassign(11,iel)=.true.
+c.............if the face 6 of ntemp is of type 2
+              if(cbc(6,ntemp).eq.2)then
+c...............The neighbor element of iel, neighbored by edge 11, is 
+c               sje(1,1,6,ntemp) (the neighbor element of ntemp on ntemp's
+c               face 6).
+                diagn(1,11,iel)=sje(1,1,6,ntemp)
+c...............The neighbor element of iel, neighbored by edge 11 shares
+c               the ijel(2,6,iel) part of edge 11 of iel
+                diagn(2,11,iel)=ijel(2,6,iel)
+c...............edge 10 of element sje(1,1,6,ntemp) (the neighbor element of 
+c               ntemp on ntemp's face 6) is a nonconforming edge
+                ncon_edge(10,sje(1,1,6,ntemp))=.true.
+c...............if_1_edge(n,iel)=.true. indicates that iel is of a smaller
+c               size than its neighbor element, neighbored by edge n of iel only.
+                if_1_edge(11,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,11,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          endif
+
+          if(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,1,iel)=sje(1,1,6,ntemp)
+                diagn(2,1,iel)=ijel(1,6,iel)
+                ncon_edge(7,sje(1,1,6,ntemp))=.true.
+                if_1_edge(1,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,1,iel)=sje(ijel(1,6,iel),1,6,ntemp)
+              endif
+            end if
+          elseif(cb1.eq.2)then
+            if(ijel(2,6,iel).eq.2)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(1,iel)=.true.
+c.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(1,iel)=.true.
+                  diagn(1,1,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+            end if
+          end if
+
+          if(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,9,iel)=sje(1,1,6,ntemp)
+                diagn(2,9,iel)=ijel(2,6,iel)
+                ncon_edge(12,sje(1,1,6,ntemp))=.true.
+                if_1_edge(9,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &           sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          elseif(cb3.eq.2)then
+            if(ijel(1,6,iel).eq.2)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(9,iel)=.true.
+c.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(9,iel)=.true.
+                  diagn(1,9,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+            end if
+          end if
+
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(5,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,5,iel)=sje(1,1,6,ntemp)
+                diagn(2,5,iel)=ijel(1,6,iel)
+                ncon_edge(3,sje(1,1,6,ntemp))=.true.
+                if_1_edge(5,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+c.......one face 4
+        if(cb4.eq.0)then
+          if(cb1.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          endif
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          end if 
+           
+        elseif(cb4.eq.1)then
+          if(cb1.eq.2)then
+            if(ijel(2,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(4,iel)=.true.
+                if(cbc(1,ntemp).eq.3.and.
+     &            sje(1,1,1,ntemp).gt.iel)then
+                  diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp) 
+                endif
+              endif
+            end if
+          elseif(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,4,iel)=sje(1,1,1,ntemp)
+                diagn(2,4,iel)=ijel(1,4,iel)
+                ncon_edge(6,sje(1,1,1,ntemp))=.true.
+                if_1_edge(4,iel)=.true.
+              endif
+              if(cbc(1,ntemp).eq.3.and.
+     &          sje(1,1,1,ntemp).gt.iel)then
+                diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp)
+              endif
+            end if
+          end if
+          if(cb5.eq.2)then
+            if(ijel(1,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(12,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+                endif
+              endif
+            end if
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,12,iel)=sje(1,1,5,ntemp)
+                diagn(2,12,iel)=ijel(2,4,iel)
+                ncon_edge(9,sje(1,1,5,ntemp))=.true.
+                if_1_edge(12,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            end if
+          end if
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(2,ntemp).ne.3.or.sje(1,1,2,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(8,iel)=.true.
+              if(cbc(2,ntemp).eq.2)then
+                diagn(1,8,iel)=sje(1,1,2,ntemp)
+                diagn(2,8,iel)=ijel(1,4,iel)
+                ncon_edge(2,sje(1,1,2,ntemp))=.true.
+                if_1_edge(8,iel)=.true.
+              endif
+              if(cbc(2,ntemp).eq.3.and.
+     &          sje(1,1,2,ntemp).gt.iel)then
+                diagn(1,8,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+c.......on face 2
+        if(cb2.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          endif
+        elseif(cb2.eq.1)then
+          if(cb3.eq.2)then
+            if(ijel(2,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).ne.3.or.
+     &          sje(1,1,3,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(6,iel)=.true.
+                if(cbc(3,ntemp).eq.3.and.
+     &            sje(1,1,3,ntemp).gt.iel)then
+                  diagn(1,6,iel)=sje(ijel(1,2,iel),2,3,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(3,ntemp).ne.3.or.sje(1,1,3,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,6,iel)=sje(1,1,3,ntemp)
+                diagn(2,6,iel)=ijel(1,2,iel)
+                ncon_edge(4,sje(1,1,3,ntemp))=.true.
+                if_1_edge(6,iel)=.true.
+              endif
+              if(cbc(3,ntemp).eq.3.and.
+     &          sje(1,1,3,ntemp).gt.iel)then
+                diagn(1,6,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          endif
+          if(cb5.eq.2)then
+            if(ijel(1,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(7,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,7,iel)=sje(ijel(2,2,iel),2,5,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,7,iel)=sje(1,1,5,ntemp)
+                diagn(2,7,iel)=ijel(2,2,iel)
+                ncon_edge(1,sje(1,1,5,ntemp))=.true.
+                if_1_edge(7,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,7,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            endif
+          endif
+        end if
+
+c.......on face 1
+        if(cb1.eq.1)then
+          newe(iel)=newe(iel)+2
+          eassign(2,iel)=.true.
+          if(cb3.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).eq.2)then
+              diagn(1,2,iel)=sje(1,1,3,ntemp)
+              diagn(2,2,iel)=ijel(1,1,iel)
+              ncon_edge(8,sje(1,1,3,ntemp))=.true.
+              if_1_edge(2,iel)=.true.
+            elseif(cbc(3,ntemp).eq.3)then
+              diagn(1,2,iel)=sje(ijel(1,1,iel),1,3,ntemp)
+            endif
+          elseif(cb3.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(ijel(2,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+          end if
+
+          eassign(3,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,3,iel)=sje(1,1,5,ntemp)
+              diagn(2,3,iel)=ijel(2,1,iel)
+              ncon_edge(5,sje(1,1,5,ntemp))=.true.
+              if_1_edge(3,iel)=.true.
+            elseif(cbc(5,ntemp).eq.3)then
+              diagn(1,3,iel)=sje(ijel(2,1,iel),1,5,ntemp)
+            endif
+          elseif(cb5.eq.2)then
+            ntemp=sje(1,1,5,iel)
+            if(ijel(1,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+            
+          end if
+        elseif(cb1.eq.2)then
+          if(cb3.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(2,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif 
+            endif
+          elseif(cb3.eq.0.or.cb3.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+            if(cb3.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif
+            endif
+          end if
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(3,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          end if
+        elseif(cb1.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+          endif
+        endif
+
+c.......on face 3
+        if(cb3.eq.1)then
+          newe(iel)=newe(iel)+1
+          eassign(10,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,10,iel)=sje(1,1,5,ntemp)
+              diagn(2,10,iel)=ijel(2,3,iel)
+              ncon_edge(11,sje(1,1,5,ntemp))=.true.
+              if_1_edge(10,iel)=.true.
+            endif
+          endif
+          if(ijel(1,3,iel).eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.3)then
+              diagn(1,10,iel)=sje(1,ijel(2,3,iel),5,ntemp)
+            endif
+          endif
+        elseif(cb3.eq.2)then
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(10,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif 
+            endif
+          end if
+        elseif(cb3.eq.0)then
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+          endif
+        endif
+
+c       CONFORMING FACE INTERIOR
+
+c.......find how many new mortar point indices will be assigned
+c       to face interiors on all faces on each element
+
+c.......newi record how many new face interior points will be assigned
+
+c.......on face 6
+        if(cb6.eq.1.or.cb6.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(6,iel)=.true.
+        end if
+c.......on face 4
+        if(cb4.eq.1.or.cb4.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(4,iel)=.true.
+        end if
+c.......on face 2
+        if(cb2.eq.1.or.cb2.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(2,iel)=.true.
+        end if
+c.......on face 1
+        if(cb1.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(1,iel)=.true.
+        end if
+c.......on face 3
+        if(cb3.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(3,iel)=.true.
+        endif
+c.......on face 5
+        if(cb5.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(5,iel)=.true.
+        endif
+
+c.......newc is the total number of new mortar point indices
+c       to be assigned to each element.
+        newc(iel)=newe(iel)*3+newi(iel)
+      end do
+c$OMP END PARALLEL DO
+
+c.....Compute (potentially in parallel) front(iel), which records how 
+c     many new mortar point indices are to be assigned (to conforming 
+c     edges and conforming face interiors) from element 1 to iel.
+c     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+      do iel=1,nelt
+        front(iel)=newc(iel)
+      end do
+c$OMP END PARALLEL DO
+
+      call parallel_add(front)
+
+c.....nmor is the total number or mortar points
+      nmor=nvertex+front(nelt)
+
+c.....Generate (potentially in parallel) new mortar point indices on 
+c     each conforming element face. On each face, first visit all 
+c     conforming edges, and then the face interior.
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,count,i,cb1,ne,
+c$OMP& space,ie,edge_g,face2,ie2,ntemp,ii,jj,jface,cb,mor_v)
+      do iel=1,nelt
+        front(iel)=front(iel)-newc(iel)
+        count=nvertex+front(iel)
+        do i=1,6
+          cb1=cbc(i,iel)
+          if (i.le.2) then
+            ne=4
+            space=1
+          elseif (i.le.4)then
+            ne=3
+            space=2
+
+c.........i loops over faces. Only 4 faces need to be examed for edge visit.
+c         On face 1, edge 1,2,3 and 4 will be visited. On face 2, edge 5,6,7
+c         and 8 will be visited. On face 3, edge 9 and 10 will be visited and
+c         on face 4, edge 11 and 12 will be visited. The 12 edges can be 
+c         covered by four faces, there is no need to visit edges on face
+c         5 and 6.  So ne is set to be 0. 
+c         However, i still needs to loop over 5 and 6, since the interiors
+c         of face 5 and 6 still need to be visited.
+
+          else
+            ne=0
+            space=1
+          end if
+
+          do ie=1,ne,space
+            edge_g=edgenumber(ie,i)
+            if(eassign(edge_g,iel))then
+c.............generate the new mortar points index, mor_v
+              call mor_assign(mor_v,count)
+c.............assign mor_v to local edge ie of face i on element iel
+              call mor_edge(ie,i,iel,mor_v)
+
+c.............Since this edge is shared by another face of element 
+c             iel, assign mor_v to the corresponding edge on the other 
+c             face also.
+
+c.............find the other face
+              face2=f_e_ef(ie,i)
+c.............find the local edge index of this edge on the other face
+              ie2=localedgenumber(face2,edge_g)
+c.............asssign mor_v  to local edge ie2 of face face2 on element iel
+              call mor_edge(ie2,face2,iel,mor_v)
+
+c.............There are some neighbor elements also sharing this edge. Assign
+c             mor_v to neighbor element, neighbored by face i.
+              if (cbc(i,iel).eq.2)then
+                ntemp=sje(1,1,i,iel)
+                call mor_edge(ie,jjface(i),ntemp,mor_v)
+                call mor_edge(op(ie2),face2,ntemp,mor_v)
+              end if
+
+c.............assign mor_v  to neighbor element neighbored by face face2
+              if (cbc(face2,iel).eq.2)then
+                ntemp=sje(1,1,face2,iel)
+                call mor_edge(ie2,jjface(face2),ntemp,mor_v)
+                call mor_edge(op(ie),i,ntemp,mor_v)
+              end if
+
+c.............assign mor_v to neighbor element sharing this edge
+
+c.............if the neighbor is of the same size of iel
+              if(.not.if_1_edge(edgenumber(ie,i),iel))then
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_edge(op(ie2),jjface(face2),ntemp,mor_v)
+                  call mor_edge(op(ie),jjface(i),ntemp,mor_v)
+                endif
+
+c.............if the neighbor has a size larger than iel's
+              else
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_ne(mor_v,diagn(2,edgenumber(ie,i),iel),
+     &            ie,i,ie2,face2,iel,ntemp)
+                end if
+              endif
+ 
+            endif
+          end do 
+
+          if(fassign(i,iel))then
+c...........generate new mortar points index in face interior. 
+c           if face i is of type 2 or iel doesn't have a neighbor element,
+c           assign new mortar point indices to interior mortar points
+c           of face i of iel.
+            cb=cbc(i,iel)
+            if (cb.eq.1.or.cb.eq.0) then
+              do jj =2,lx1-1
+                do ii=2,lx1-1
+                  count=count+1
+                  idmo(ii,jj,1,1,i,iel)=count
+                end do
+              end do
+
+c...........if face i is of type 2, assign new mortar point indices
+c           to iel as well as to the neighboring element on face i
+            elseif (cb.eq.2) then
+              if (idmo(2,2,1,1,i,iel).eq.0) then
+                ntemp=sje(1,1,i,iel)
+                jface = jjface(i)
+                do jj =2,lx1-1
+                  do ii=2,lx1-1
+                    count=count+1
+                    idmo(ii,jj,1,1,i,iel)=count
+                    idmo(ii,jj,1,1,jface,ntemp)=count
+                  end do
+                end do
+              end if 
+            end if
+          end if
+        end do
+      end do 
+c$OMP END  PARALLEL DO
+
+ 
+c.....for edges on nonconforming faces, copy the mortar points indices
+c     from neighbors.
+c$OMP PARALLEL DO DEFAULT(SHARED)
+c$OMP& PRIVATE(iel,i,cb,jface,iii,jjj,ntemp,ii,jj)
+      do iel=1,nelt
+        do i=1,6
+          cb=cbc(i,iel)
+          if (cb.eq.3) then
+c...........edges 
+            call edgecopy_s(i,iel)
+          end if 
+
+c.........face interior 
+
+          jface = jjface(i)
+          if (cb.eq.3) then
+            do iii=1,2
+              do jjj=1,2
+                ntemp=sje(iii,jjj,i,iel) 
+                do jj =1,lx1
+                  do ii=1,lx1
+                    idmo(ii,jj,iii,jjj,i,iel)=
+     &                         idmo(ii,jj,1,1,jface,ntemp)
+                  end do
+                end do
+                idmo(1,1,iii,jjj,i,iel)=idmo(1,1,1,1,jface,ntemp)
+                idmo(lx1,1,iii,jjj,i,iel)=idmo(lx1,1,1,2,jface,ntemp)
+                idmo(1,lx1,iii,jjj,i,iel)=idmo(1,lx1,2,1,jface,ntemp)
+                idmo(lx1,lx1,iii,jjj,i,iel)=
+     &                         idmo(lx1,lx1,2,2,jface,ntemp)
+              end do
+            end do
+          end if
+        end do
+      end do
+c$OMP END PARALLEL DO
+      return
+      end
+       
+c-----------------------------------------------------------------
+       subroutine get_emo(ie,n,ng)
+c-----------------------------------------------------------------
+c      This subroutine fills array emo.
+c      emo  records all elements sharing the same mortar point 
+c                 (only applies to element vertices) .
+c      emo(1,i,n) gives the element ID of the i'th element sharing
+c                 mortar point n. (emo(1,i,n)=ie), ie is element
+c                 index.
+c      emo(2,i,n) gives the vertex index of mortar point n on this
+c                 element (emo(2,i,n)=ng), ng is the vertex index.
+c      nemo(n) records the total number of elements sharing mortar 
+c                 point n.
+c-----------------------------------------------------------------
+ 
+       include 'header.h'
+
+       integer ie, n, ntemp, i,ng
+       logical L1
+
+       L1=.false.
+       do i=1,nemo(n)
+         if (emo(1,i,n).eq.ie) L1=.true.
+       end do
+       if (.not.L1) then
+c$       call omp_set_lock(tlock(n))
+         ntemp=nemo(n)+1
+         nemo(n)=ntemp
+         emo(1,ntemp,n)=ie
+         emo(2,ntemp,n)=ng
+c$       call omp_unset_lock(tlock(n))
+       end if
+
+       return
+       end 
+
+c-----------------------------------------------------------------
+      logical function ifsame(ntemp,j,iel,i)
+c-----------------------------------------------------------------
+c     Check whether the i's vertex of element iel is at the same
+c     location as j's vertex of element ntemp.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+      integer iel, i, ntemp, j
+
+      ifsame=.false.
+      if (ntemp.eq.0 .or. iel.eq.0) return
+      if (xc(i,iel).eq.xc(j,ntemp).and.
+     &    yc(i,iel).eq.yc(j,ntemp).and.
+     &    zc(i,iel).eq.zc(j,ntemp)) then
+        ifsame=.true.
+      end if
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine mor_assign(mor_v,count)
+c-----------------------------------------------------------------
+c     Assign three consecutive numbers for mor_v, which will
+c     be assigned to the three interior points of an edge as the 
+c     mortar point indices.
+c-----------------------------------------------------------------
+      
+      implicit none
+      integer mor_v(3),count,i
+   
+      do i=1,3 
+        count=count+1
+        mor_v(i)=count
+      end do
+
+      return
+      end  
+     
+c-----------------------------------------------------------------
+      subroutine mor_edge(ie,face,iel,mor_v)
+c-----------------------------------------------------------------
+c     Copy the mortar points index from mor_v to local 
+c     edge ie of the face'th face on element iel.
+c     The edge is conforming.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      integer ie,i,iel,mor_v(3),j,nn,face
+
+      if (ie.eq.1) then
+        j=1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.2) then 
+        i=lx1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.3) then 
+        j=lx1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.4) then 
+        i=1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      end if
+
+      return
+      end 
+
+c------------------------------------------------------------
+      subroutine edgecopy_s(face,iel)
+c------------------------------------------------------------
+c     Copy mortar points index on edges from neighbor elements 
+c     to an element face of the 3rd type.
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer face, iel, ntemp1, ntemp2, ntemp3, ntemp4, 
+     &         edge_g, edge_l, face2, mor_s_v(4,2), i
+
+c......find four neighbors on this face (3rd type)
+       ntemp1=sje(1,1,face,iel)
+       ntemp2=sje(1,2,face,iel)
+       ntemp3=sje(2,1,face,iel)
+       ntemp4=sje(2,2,face,iel)
+
+c......local edge 1
+
+c......mor_s_v is the array of mortar indices to  be copied.
+       call nrzero(mor_s_v,4*2)
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,1,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,1,1,2,jjface(face),ntemp1)
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(i,1,1,1,jjface(face),ntemp2)
+       end do
+
+c......copy mor_s_v to local edge 1 on this face
+       call mor_s_e(1,face,iel,mor_s_v)
+
+c......copy mor_s_v to the corresponding edge on the other face sharing
+c      local edge 1
+       face2=f_e_ef(1,face)
+       edge_g=edgenumber(1,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 2
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(lx1,i,1,1,jjface(face),ntemp2)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp2)
+
+       mor_s_v(1,2)=idmo(lx1,1,1,2,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(lx1,i,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(2,face,iel,mor_s_v)
+       face2=f_e_ef(2,face)
+       edge_g=edgenumber(2,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 3
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,lx1,1,1,jjface(face),ntemp3)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp3)
+
+       mor_s_v(1,2)=idmo(1,lx1,2,1,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(i,lx1,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(3,face,iel,mor_s_v)
+       face2=f_e_ef(3,face)
+       edge_g=edgenumber(3,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 4
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(1,i,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(1,lx1,2,1,jjface(face),ntemp1)
+
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(1,i,1,1,jjface(face),ntemp3)
+       end do
+
+       call mor_s_e(4,face,iel,mor_s_v)
+       face2=f_e_ef(4,face)
+       edge_g=edgenumber(4,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+       return
+       end
+
+c------------------------------------------------------------
+       subroutine mor_s_e(n,face,iel,mor_s_v)
+c------------------------------------------------------------
+c      Copy mortar points index from mor_s_v to local edge n
+c      on face "face" of element iel. The edge is nonconforming. 
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer n,face,iel,mor_s_v(4,2), i
+
+       if (n.eq.1) then
+         do i=2,lx1
+           idmo(i,1,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,1,1,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.2) then
+         do i=2,lx1
+          idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+          idmo(lx1,i,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.3) then
+         do i=2,lx1
+           idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,lx1,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.4) then
+         do i=2,lx1
+           idmo(1,i,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(1,i,2,1,face,iel)=mor_s_v(i,2)
+         end do
+       end if
+       return
+       end
+
+c------------------------------------------------------------
+       subroutine mor_s_e_nn(n,face,iel,mor_s_v,nn)
+c------------------------------------------------------------
+c      Copy mortar point indices from mor_s_v to local edge n
+c      on face "face" of element iel. nn is the edge mortar index,
+c      which indicates that mor_s_v  corresponds to left/bottom or 
+c      right/top part of the edge.
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer n,face,iel,mor_s_v(4), i,nn
+
+       if (n.eq.1) then
+         if(nn.eq.1)then
+            do i=2,lx1
+              idmo(i,1,1,1,face,iel)=mor_s_v(i-1)
+            end do
+         else
+           do i=1,lx1-1
+             idmo(i,1,1,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.2) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(lx1,i,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.3) then
+         if(nn.eq.1)then
+           do i=2,lx1
+             idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(i,lx1,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.4) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(1,i,1,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(1,i,2,1,face,iel)=mor_s_v(i)
+           end do
+         endif
+       end if
+       return
+       end
+
+
+c---------------------------------------------------------------
+      subroutine mortar_vertex(i,iel,count)
+c---------------------------------------------------------------
+c     Assign mortar point index "count" to iel's i'th vertex
+c     and also to all elements sharing this vertex.
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,iel,count,ntempx(8),ifntempx(8),lc_a(3),nnb(3),
+     &        face_a(3),itemp,ntemp,ii, jj,j(3),
+     &        iintempx(3),l,nbe, lc, temp
+      logical ifsame,if_temp
+
+      do l= 1,8
+        ntempx(l)=0
+        ifntempx(l)=0
+      end do
+
+c.....face_a records the three faces sharing this vertex on iel.
+c     lc_a gives the local corner number of this vertex on each 
+c     face in face_a.
+
+      do l=1,3
+        face_a(l)=f_c(l,i)
+        lc_a(l)=local_corner(i,face_a(l))
+      end do
+
+c.....each vertex is shared by at most 8 elements. 
+c     ntempx(j) gives the element index of a POSSIBLE element with its 
+c               j'th  vertex is iel's i'th vertex
+c     ifntempx(i)=ntempx(i) means  ntempx(i) exists 
+c     ifntempx(i)=0 means ntempx(i) does not exist.
+
+      ntempx(9-i)=iel
+      ifntempx(9-i)=iel
+
+c.....first find all elements sharing this vertex, ifntempx
+
+c.....find the three possible neighbors of iel, neighbored by faces 
+c     listed in array face_a
+
+      do itemp= 1, 3
+
+c.......j(itemp) is the local corner number of this vertex on the 
+c       neighbor element on the corresponding face.
+        j(itemp)=c_f(lc_a(itemp),jjface(face_a(itemp)))
+
+c.......iitempx(itemp) records the vertex index of i on the
+c       neighbor element, neighborned by face_a(itemp)
+        iintempx(itemp)=cal_intempx(lc_a(itemp),face_a(itemp))
+
+c.......ntemp refers the neighbor element 
+        ntemp=0
+
+c.......if the face is nonconforming, find out in which piece of the 
+c       mortar the vertex is located
+        ii=cal_iijj(1,lc_a(itemp))
+        jj=cal_iijj(2,lc_a(itemp))
+        ntemp=sje(ii,jj,face_a(itemp),iel)
+
+c.......if the face is conforming
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(itemp),iel)
+c.........find the possible neighbor        
+          ntempx(iintempx(itemp))=ntemp
+c.........check whether this possible neighbor is a real neighbor or not
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+
+c.......if the face is nonconforming
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+              ntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+        end if 
+      end do 
+
+c.....find the possible three neighbors, neighbored by an edge only
+      do l=1,3
+
+c.....find first existing neighbor of any of the faces in array face_a
+        if_temp=.false.
+        if(l.eq.1)then
+          if_temp=.true.
+        elseif(l.eq.2)then
+          if(ifntempx(iintempx(l-1)).eq.0)then
+            if_temp=.true.
+          end if
+        elseif(l.eq.3)then
+          if(ifntempx(iintempx(l-1)).eq.0
+     &       .and.ifntempx(iintempx(l-2)).eq.0) then
+            if_temp=.true.
+          end if
+        end if
+           
+        if(if_temp)then
+          if (ifntempx(iintempx(l)).ne.0) then
+            nbe=ifntempx(iintempx(l))
+c...........if 1st neighor exists, check the neighbor's two neighbors in
+c           the other two directions. 
+c           e.g. if l=1, check directions 2 and 3,i.e. itemp=2,3,1
+c           if l=2, itemp=3,1,-2
+c           if l=3, itemp=1,2,1
+c
+            do itemp=face_l1(l),face_l2(l),face_ld(l)
+c.............lc is the local corner number of this vertex on face face_a(itemp)
+c             on the neighbor element of iel, neighbored by a face face_a(l)
+              lc=local_corner(j(l),face_a(itemp))
+c.............temp is the vertex index of this vertex on the neighbor element
+c             neighbored by an edge
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+
+c.............if the face face_a(itemp) is conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &            nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+c...................nnb(itemp) records the neighbor element neighbored by an
+c                   edge only
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+
+c.............if the face face_a(itemp) is nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &              nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+              end if
+            end do
+
+c...........check the last neighbor element, neighbored by an edge
+
+c...........ifntempx(iintempx(l)) has been visited in the above, now 
+c           check another neighbor element(nbe) neighbored by a face 
+
+c...........if the neighbor element is neighbored by face 
+c           face_a(face_l1(l)) exists
+            if(ifntempx(iintempx(face_l1(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l1(l)))
+c.............itemp is the last direction other than l and face_l1(l)
+              itemp=face_l2(l)
+              lc=local_corner(j(face_l1(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+
+c.............ntemp records the last neighbor element neighbored by an edge
+c             with element iel
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+c.............if conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &              nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+c.............if nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &            nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+
+c...........if the neighbor element neighbored by face face_a(face_l2(l)) 
+c           does not exist
+            elseif(ifntempx(iintempx(face_l2(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l2(l)))
+              itemp=face_l1(l)
+              lc=local_corner(j(face_l2(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &            nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              else
+                if(ntemp.ne.0.)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &            nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+            endif
+          endif
+        end if
+      end do
+
+c.....check the neighbor element, neighbored by a vertex only
+
+c.....nnb are the three possible neighbor elements neighbored by an edge
+
+      nnb(1)=ifntempx(cal_nnb(1,i))
+      nnb(2)=ifntempx(cal_nnb(2,i))
+      nnb(3)=ifntempx(cal_nnb(3,i))
+      ntemp=0
+
+c.....the neighbor element neighbored by a vertex must be a neighbor of
+c     a valid(nonzero) nnb(i), neighbored by a face 
+
+      if(nnb(1).ne.0)then
+        lc=oplc(local_corner(i,face_a(3)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+c.......ntemp records the neighbor of iel, neighbored by vertex i 
+        ntemp=sje(ii,jj,face_a(3),nnb(1))
+c.......temp is the vertex index of i on ntemp
+        temp=cal_intempx(lc,face_a(3))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(3),nnb(1))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(2).ne.0)then
+        lc=oplc(local_corner(i,face_a(1)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(1),nnb(2))
+        temp=cal_intempx(lc,face_a(1))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(1),nnb(2))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &        c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &      c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(3).ne.0)then
+        lc=oplc(local_corner(i,face_a(2)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(2),nnb(3))
+        temp=cal_intempx(lc, face_a(2))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(2),nnb(3))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &         c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &        c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      end if
+
+c.....ifntempx records all elements sharing this vertex, assign count
+c     to all these elements.
+
+      if (ifntempx(1).ne.0) then
+        idmo(lx1,lx1,2,2,1,ntempx(1))=count
+        idmo(lx1,lx1,2,2,3,ntempx(1))=count
+        idmo(lx1,lx1,2,2,5,ntempx(1))=count
+        call get_emo(ntempx(1),count,8)
+      end if
+
+      if (ifntempx(2).ne.0) then
+        idmo(lx1,lx1,2,2,2,ntempx(2))=count
+        idmo(1,lx1,2,1,3,ntempx(2))=count
+        idmo(1,lx1,2,1,5,ntempx(2))=count
+        call get_emo(ntempx(2),count,7)
+      end if
+
+      if (ifntempx(3).ne.0) then
+        idmo(1,lx1,2,1,1,ntempx(3))=count
+        idmo(lx1,lx1,2,2,4,ntempx(3))=count
+        idmo(lx1,1,1,2,5,ntempx(3))=count
+        call get_emo(ntempx(3),count,6)
+      end if
+      if (ifntempx(4).ne.0) then
+        idmo(1,lx1,2,1,2,ntempx(4))=count
+        idmo(1,lx1,2,1,4,ntempx(4))=count
+        idmo(1,1,1,1,5,ntempx(4))=count
+        call get_emo(ntempx(4),count,5)
+      end if
+
+      if (ifntempx(5).ne.0) then
+        idmo(lx1,1,1,2,1,ntempx(5))=count
+        idmo(lx1,1,1,2,3,ntempx(5))=count
+        idmo(lx1,lx1,2,2,6,ntempx(5))=count
+        call get_emo(ntempx(5),count,4)
+      end if
+
+
+      if (ifntempx(6).ne.0) then
+        idmo(lx1,1,1,2,2,ntempx(6))=count
+        idmo(1,1,1,1,3,ntempx(6))=count
+        idmo(1,lx1,2,1,6,ntempx(6))=count
+        call get_emo(ntempx(6),count,3)
+      end if
+
+      if (ifntempx(7).ne.0) then
+        idmo(1,1,1,1,1,ntempx(7))=count
+        idmo(lx1,1,1,2,4,ntempx(7))=count
+        idmo(lx1,1,1,2,6,ntempx(7))=count
+        call get_emo(ntempx(7),count,2)
+      end if
+
+      if (ifntempx(8).ne.0) then
+        idmo(1,1,1,1,2,ntempx(8))=count
+        idmo(1,1,1,1,4,ntempx(8))=count
+        idmo(1,1,1,1,6,ntempx(8))=count
+        call get_emo(ntempx(8),count,1)
+      end if
+
+      return
+      end
+
+     
+c---------------------------------------------------------------
+      subroutine mor_ne(mor_v,nn,edge,face,edge2,face2,ntemp,iel)
+c---------------------------------------------------------------
+c     Copy the mortar points index  (mor_v + vertex mortar point) from
+c     edge'th local edge on face'th face on element ntemp to iel.
+c     ntemp is iel's neighbor, neighbored by this edge only. 
+c     This subroutine is for the situation that iel is of larger
+c     size than ntemp.  
+c     face, face2 are face indices
+c     edge and edge2 are local edge numbers of this edge on face and face2
+c     nn is edge motar index, which indicate whether this edge
+c     corresponds to the left/bottom or right/top part of the edge
+c     on iel.
+c---------------------------------------------------------------
+      include 'header.h'
+
+      integer mor_v(3),nn,edge,face,edge2,face2,ntemp,iel, i, 
+     &mor_s_v(4)
+
+c.....get mor_s_v which is the mor_v + vertex mortar
+      if (edge.eq.3) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,lx1,2,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      
+      elseif (edge.eq.4) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(1,lx1,2,1,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+
+      elseif (edge.eq.1) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,1,1,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+         endif
+
+      else if (edge.eq.2) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+             mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(lx1,1,1,2,face,ntemp)
+          do i=2,lx1-1
+             mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      end if
+
+c.....copy mor_s_v to iel's local edge(op(edge)), on face jjface(face)
+      call mor_s_e_nn(op(edge),jjface(face),iel,mor_s_v,nn)
+c.....copy mor_s_v to iel's local edge(op(edge2)), on face jjface(face2)
+c     since this edge is shared by two faces on iel
+      call mor_s_e_nn(op(edge2),jjface(face2),iel,mor_s_v,nn)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/move.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/move.f
new file mode 100644
index 0000000..0d388f7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/move.f
@@ -0,0 +1,88 @@
+c---------------------------------------------------------------
+      subroutine move
+c---------------------------------------------------------------
+c     move element to proper location in morton space filling curve
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,iside,jface,iel,ntemp,ii1,ii2,n1,n2,cb
+
+      n2=2*6*nelt
+      n1=n2*2
+
+      call nr_init_omp(sje_new,n1,0)
+      call nr_init_omp(ijel_new,n2,0)
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iel,i,iside,jface,cb,ntemp,
+c$OMP& ii1,ii2) 
+c$OMP DO
+      do iel=1,nelt
+        i=mt_to_id(iel)
+        treenew(iel)=tree(i)
+        call copy(xc_new(1,iel),xc(1,i),8)
+        call copy(yc_new(1,iel),yc(1,i),8)
+        call copy(zc_new(1,iel),zc(1,i),8)
+
+        do iside=1,nsides
+          jface = jjface(iside)
+          cb=cbc(iside,i)
+          xc_new(iside,iel)=xc(iside,i)
+          yc_new(iside,iel)=yc(iside,i)
+          zc_new(iside,iel)=zc(iside,i)
+          cbc_new(iside,iel)=cb
+
+          if(cb.eq.2)then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=1
+            ijel_new(2,iside,iel)=1
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+
+          else if(cb.eq.1) then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=ijel(1,iside,i)
+            ijel_new(2,iside,iel)=ijel(2,iside,i)
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+         
+          else if(cb.eq.3) then
+            do ii2=1,2
+              do ii1=1,2
+                ntemp=sje(ii1,ii2,iside,i)
+                ijel_new(1,iside,iel)=1
+                ijel_new(2,iside,iel)=1
+                sje_new(ii1,ii2,iside,iel)=id_to_mt(ntemp)
+              end do
+            end do
+
+          else if(cb.eq.0)then
+            sje_new(1,1,iside,iel)=0
+            sje_new(1,2,iside,iel)=0
+            sje_new(2,1,iside,iel)=0
+            sje_new(2,2,iside,iel)=0       
+          end if 
+
+        end do
+
+        call copy(ta2(1,1,1,iel),ta1(1,1,1,i),nxyz)
+
+      end do
+c$OMP ENDDO
+
+c$OMP DO
+      do iel=1,nelt
+        call copy(xc(1,iel),xc_new(1,iel),8)
+        call copy(yc(1,iel),yc_new(1,iel),8)
+        call copy(zc(1,iel),zc_new(1,iel),8)
+        call copy(ta1(1,1,1,iel),ta2(1,1,1,iel),nxyz)
+        call ncopy(sje(1,1,1,iel),sje_new(1,1,1,iel),4*6)
+        call ncopy(ijel(1,1,iel),ijel_new(1,1,iel),2*6)
+        call ncopy(cbc(1,iel),cbc_new(1,iel),6)
+        mt_to_id(iel)=iel
+        id_to_mt(iel)=iel
+        tree(iel)=treenew(iel)
+      end do
+c$OMP ENDDO 
+c$OMP END PARALLEL
+
+      return
+      end 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/precond.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/precond.f
new file mode 100644
index 0000000..ba643cc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/precond.f
@@ -0,0 +1,792 @@
+c------------------------------------------------------------------
+      subroutine setuppc
+c------------------------------------------------------------------
+c     Generate diagonal preconditioner for CG.
+c     Preconditioner computed in this subroutine is correct only
+c     for collocation point in element interior, on conforming face
+c     interior and conforming edge.
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dxtm1_2(lx1,lx1), rdtime
+      integer ie,k,i,j,q,isize
+
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1_2(i,j)=dxtm1(i,j)**2
+        end do
+      end do
+
+      rdtime=1.d0/dtime
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,isize,i,j,k,q) 
+      do ie = 1, nelt
+        call r_init(dpcelm(1,1,1,ie),nxyz,0.d0)
+        isize=size_e(ie)
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              do q = 1, lx1
+                dpcelm(i,j,k,ie) = dpcelm(i,j,k,ie) + 
+     &                        g1m1_s(q,j,k,isize) * dxtm1_2(i,q) +
+     &                        g1m1_s(i,q,k,isize) * dxtm1_2(j,q) +
+     &                        g1m1_s(i,j,q,isize) * dxtm1_2(k,q)
+              end do
+              dpcelm(i,j,k,ie)=visc*dpcelm(i,j,k,ie)+
+     &                      rdtime*bm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+c.....do the stiffness summation
+      call dssum
+
+c.....take inverse.
+
+      call reciprocal(dpcelm,ntot)
+
+c.....compute preconditioner on mortar points. NOTE:  dpcmor for 
+c     nonconforming cases will be corrected in subroutine setpcmo 
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i=1,nmor
+        dpcmor(i)=1.d0/dpcmor(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+
+c--------------------------------------------------------------
+      subroutine setpcmo_pre
+c--------------------------------------------------------------
+c     pre-compute elemental contribution to preconditioner  
+c     for all situations
+c--------------------------------------------------------------
+      
+      include 'header.h'
+
+      integer element_size, i, j, ii, jj, col
+      double precision
+     &       p(lx1,lx1,lx1), p0(lx1,lx1,lx1), mtemp(lx1,lx1), 
+     &       temp(lx1,lx1,lx1), temp1(lx1,lx1), tmp(lx1,lx1),tig(lx1)
+
+c.....corners on face of type 3 
+
+      call r_init(tcpre,lx1*lx1,0.d0)
+      call r_init(tmp,lx1*lx1,0.d0)
+      call r_init(tig,5,0.d0)
+      tig(1)   =1.d0
+      tmp(1,1) =1.d0 
+
+c.....tcpre results from mapping a unit spike field (unity at 
+c     collocation point (1,1), zero elsewhere) on an entire element
+c     face to the (1,1) segment of a nonconforming face
+      do i=2,lx1-1
+        do j=1,lx1
+          tmp(i,1) = tmp(i,1)+ qbnew(i-1,j,1)*tig(j)
+        end do
+      end do
+ 
+      do col=1,lx1
+        tcpre(col,1)=tmp(col,1)
+
+        do j=2,lx1-1
+          do i=1,lx1
+            tcpre(col,j) = tcpre(col,j) + qbnew(j-1,i,1)*
+     &                                     tmp(col,i)
+          end do
+        end do
+      end do
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(element_size,i,j,p,temp,
+c$OMP& mtemp,temp1,p0,ii,jj)
+      do element_size=1,refine_max
+
+c.......for conforming cases
+
+c.......pcmor_c (i,j,element_size) records the intermediate value 
+c       (preconditioner=1/pcmor_c) of the preconditor on collocation 
+c       point (i,j) on a conforming face of an element of size 
+c       element_size.
+
+        do j=1,lx1/2+1
+          do i=j,lx1/2+1
+            call r_init(p,nxyz,0.d0)
+            p(i,j,1)=1.d0
+            call laplacian(temp,p,element_size)
+            pcmor_c(i,j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,j,element_size)=temp(i,j,1)
+            pcmor_c(j,i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,i,element_size)=temp(i,j,1)
+            pcmor_c(j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(i,lx1+1-j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,lx1+1-j,element_size)=temp(i,j,1)
+          end do
+        end do
+
+c.......for nonconforming cases 
+
+c.......nonconforming face interior
+
+c.......pcmor_nc1(i,j,ii,jj,element_size) records the intermediate 
+c       preconditioner value on collocation point (i,j) on mortar 
+c       (ii,jj)  on a nonconforming face of an element of size element_
+c       size
+        do j=2,lx1
+          do i=j,lx1
+            call r_init(mtemp,lx1*lx1,0.d0)
+            call r_init(p,nxyz,0.d0)
+            mtemp(i,j)=1.d0
+c...........when i, j=lx1, mortar points are duplicated, so mtemp needs
+c           to be doubled.
+            if(i.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            if(j.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            call transf_nc(mtemp,p)
+            call laplacian(temp,p,element_size)
+            call transfb_nc1(temp1,temp)
+
+c...........values at points (i,j) and (j,i) are the same
+            pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)
+            pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)
+          end do
+
+c.........when i, j=lx1, mortar points are duplicated. so pcmor_nc1 needs
+c         to be doubled on those points
+          pcmor_nc1(lx1,j,1,1,element_size)=
+     &          pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+          pcmor_nc1(j,lx1,1,1,element_size)=
+     &          pcmor_nc1(lx1,j,1,1,element_size)
+
+        end do
+        pcmor_nc1(lx1,lx1,1,1,element_size)=
+     &      pcmor_nc1(lx1,lx1,1,1,element_size)*2.d0
+
+c.......nonconforming edges
+        j=1
+        do i=2,lx1
+          call r_init(mtemp,lx1*lx1,0.d0)
+          call r_init(p,nxyz,0.d0)
+          call r_init(p0,nxyz,0.d0)
+          mtemp(i,j)=1.d0
+          if(i.eq.lx1)mtemp(i,j)=2.d0
+          call transf_nc(mtemp,p)
+          call laplacian(temp,p,element_size)                          
+          call transfb_nc1(temp1,temp)                   
+          pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)      
+          pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)                              
+          do ii=1,lx1
+c...........p0 is for the case that a nonconforming edge is shared by
+c           two conforming faces
+            p0(ii,1,1)=p(ii,1,1)
+            do jj=1,lx1 
+c.............now p is for the case that a nonconforming edge is shared
+c             by nonconforming faces
+              p(ii,1,jj)=p(ii,jj,1)
+            end do
+          end do
+
+          call laplacian(temp,p,element_size)
+          call transfb_nc2(temp1,temp)                
+
+c.........pcmor_nc2(i,j,ii,jj,element_size) gives the intermediate
+c         preconditioner value on collocation point (i,j) on a 
+c         nonconforming face of an element with size size_element
+
+          pcmor_nc2(i,j,1,1,element_size)=temp1(i,j)*2.d0 
+          pcmor_nc2(j,i,1,1,element_size)=
+     &          pcmor_nc2(i,j,1,1,element_size)
+
+          call laplacian(temp,p0,element_size) 
+          call transfb_nc0(temp1,temp)                  
+
+c.........pcmor_nc0(i,j,ii,jj,element_size) gives the intermediate
+c         preconditioner value on collocation point (i,j) on a 
+c         conforming face of an element, which shares a nonconforming 
+c         edge with another conforming face
+          pcmor_nc0(i,j,1,1,element_size)=temp1(i,j)
+          pcmor_nc0(j,i,1,1,element_size)=temp1(i,j)
+        end do
+        pcmor_nc1(lx1,j,1,1,element_size)=
+     &        pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc1(j,lx1,1,1,element_size)=
+     &        pcmor_nc1(lx1,j,1,1,element_size)
+        pcmor_nc2(lx1,j,1,1,element_size)=
+     &        pcmor_nc2(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc2(j,lx1,1,1,element_size)=
+     &        pcmor_nc2(lx1,j,1,1,element_size)
+        pcmor_nc0(lx1,j,1,1,element_size)=
+     &        pcmor_nc0(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc0(j,lx1,1,1,element_size)=
+     &        pcmor_nc0(lx1,j,1,1,element_size)
+
+c.......symmetrical copy
+        do i=1,lx1-1
+          pcmor_nc1(i,j,1,2,element_size)=
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do
+
+        do j=2,lx1                                            
+          do i=1,lx1-1
+            pcmor_nc1(i,j,1,2,element_size)=
+     &            pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,1,2,element_size)=
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do                                                
+
+        j=1
+        i=1
+        pcmor_nc1(i,j,2,1,element_size)=
+     &        pcmor_nc1(i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,1,element_size)=
+     &        pcmor_nc0(i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,1,element_size)=
+     &        pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        do j=2,lx1-1
+          i=1
+          pcmor_nc1(i,j,2,1,element_size)=
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+          do i=2,lx1
+            pcmor_nc1(i,j,2,1,element_size)=
+     &            pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          end do
+        end do
+
+        j=lx1
+        do i=2,lx1
+          pcmor_nc1(i,j,2,1,element_size)=
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        end do
+
+        j=1
+        i=lx1
+        pcmor_nc1(i,j,2,2,element_size)=
+     &        pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,2,element_size)=
+     &        pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,2,element_size)=
+     &        pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)
+          
+        do j=2,lx1-1                                            
+          do i=2,lx1-1
+            pcmor_nc1(i,j,2,2,element_size)=
+     &            pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,2,2,element_size)=                                       
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)                               
+          pcmor_nc0(i,j,2,2,element_size)=                                       
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)   
+          pcmor_nc2(i,j,2,2,element_size)=                                       
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)                     
+        end do                                                
+        j=lx1
+        do i=2,lx1-1
+          pcmor_nc1(i,j,2,2,element_size)=                                       
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc0(i,j,2,2,element_size)=
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc2(i,j,2,2,element_size)=                                       
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)    
+        end do
+
+
+c.......vertices shared by at least one nonconforming face or edge
+
+c.......Among three edges and three faces sharing a vertex on an element
+c       situation 1: only one edge is nonconforming
+c       situation 2: two edges are nonconforming
+c       situation 3: three edges are nonconforming
+c       situation 4: one face is nonconforming 
+c       situation 5: one face and one edge are nonconforming 
+c       situation 6: two faces are nonconforming
+c       situation 7: three faces are nonconforming
+
+        call r_init(p0,nxyz,0.d0)
+        p0(1,1,1)=1.d0
+        call laplacian(temp,p0,element_size)
+        pcmor_cor(8,element_size)=temp(1,1,1)
+
+c.......situation 1
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size) 
+        call transfb_cor_e(1,pcmor_cor(1,element_size),temp)                  
+
+c.......situation 2
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(2,pcmor_cor(2,element_size),temp)                  
+
+c.......situation 3
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(3,pcmor_cor(3,element_size),temp)                  
+
+c.......situation 4
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(4,pcmor_cor(4,element_size),temp)
+
+c.......situation 5
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        do i=1,lx1
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(5,pcmor_cor(5,element_size),temp)
+ 
+c.......situation 6
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(6,pcmor_cor(6,element_size),temp)
+
+c.......situation 7
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+            p0(1,i,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(7,pcmor_cor(7,element_size),temp)
+
+      end do    
+c$OMP END PARALLEL DO     
+      return
+      end 
+
+
+c------------------------------------------------------------------------
+      subroutine setpcmo
+c------------------------------------------------------------------------
+c     compute the preconditioner by identifying its geometry configuration
+c     and sum the values from the precomputed elemental contributions
+c------------------------------------------------------------------------
+      
+      include 'header.h'
+
+      integer face2, nb1, nb2, sizei, imor, enum, i,j, 
+     &        iel, iside, nn1, nn2
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IMOR,IEL,ISIDE,I) 
+c$OMP DO
+      do imor=1,nvertex
+       ifpcmor(imor)=.false.
+      end do
+c$OMP END DO nowait
+   
+c$OMP DO 
+      do iel=1,nelt
+        do iside=1,nsides
+          do i=1,4
+            edgevis(i,iside,iel)=.false.
+          end do 
+        end do 
+      end do 
+c$OMP END DO 
+c$OMP END PARALLEL
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(IEL,iside,sizei,
+c$OMP& imor,enum,face2,nb1,nb2,i,j,nn1,nn2) 
+
+      do iel=1,nelt
+        do iside=1,nsides
+c.........for nonconforming faces
+          if(cbc(iside,iel).eq.3)then
+            sizei=size_e(iel)
+
+c...........vertices
+
+c...........ifpcmor(imor)=.true. indicates that mortar point imor has 
+c           been visited
+            imor=idmo(1,1,1,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+c.............compute the preconditioner on mortar point imor
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,1,1,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(1,lx1,2,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,lx1,2,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+c...........edges on nonconforming faces, enum is local edge number
+            do enum=1,4
+
+c.............edgevis(enum,iside,iel)=.true. indicates that local edge 
+c             enum of face iside of iel has been visited
+              if(.not.edgevis(enum,iside,iel))then
+                edgevis(enum,iside,iel)=.true.
+
+c...............Examing neighbor element information,
+c               calculateing the preconditioner value.
+                face2= f_e_ef(enum,iside)
+                if(cbc(face2,iel).eq.2)then
+                  nb1=sje(1,1,face2,iel)
+                  if(cbc(iside,nb1).eq.2)then
+
+c...................Compute the preconditioner on local edge enum on face
+c                   iside of element iel, 1 is neighborhood information got
+c                   by examing neighbors(nb1). For detailed meaning of 1, 
+c                   see subroutine com_dpc.
+
+                    call com_dpc(iside,iel,enum,1,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(e_face2(enum,iside)),
+     &                      jjface(face2),nb2)=.true.
+
+                  elseif(cbc(iside,nb1).eq.3)then
+                    call com_dpc(iside,iel,enum,2,sizei)
+                    edgevis(op(enum),iside,nb1)=.true.
+                  end if
+
+                elseif(cbc(face2,iel).eq.3)then
+                  edgevis(e_face2(enum,iside),face2,iel)=.true.
+                  nb1=sje(1,2,face2,iel)
+                  if(cbc(iside,nb1).eq.1)then
+                    call com_dpc(iside,iel,enum,3,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(enum),jjface(iside),nb2)=.true.
+                    edgevis(op(e_face2(enum,iside)),
+     &                      jjface(face2),nb2)=.true.
+                  elseif(cbc(iside,nb1).eq.2)then
+                    call com_dpc(iside,iel,enum,4,sizei)
+                  end if
+                else if (cbc(face2,iel).eq.0)then
+                  call com_dpc(iside,iel,enum,0,sizei)
+                end if
+              end if
+            end do
+
+c...........mortar element interior (not edge of mortar) 
+
+            do nn1=1,2
+              do nn2=1,2
+                do j=2,lx1-1
+                  do i=2,lx1-1
+                    imor=idmo(i,j,nn1,nn2,iside,iel) 
+                    dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,nn1,nn2,sizei)+
+     &                                pcmor_c(i,j,sizei+1))
+                  end do
+                end do
+              end do
+            end do
+
+c...........for i,j=lx1 there are duplicated mortar points, so 
+c           pcmor_c needs to be doubled or quadrupled
+            i=lx1
+            do j=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)            
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,2,1,iside,iel)                
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,2,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do      
+
+            j=lx1
+            imor=idmo(i,j,1,1,iside,iel)                                         
+            dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                        pcmor_c(i,j,sizei+1)*4.d0)
+            do i=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)  
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,1,2,iside,iel) 
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,2,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do
+
+          end if 
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c--------------------------------------------------------------------------
+      subroutine pc_corner(imor)
+c------------------------------------------------------------------------
+c     calculate preconditioner value for vertex with mortar index imor
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmortemp
+      integer imor, inemo,ie, sizei,cornernumber,
+     &        sface,sedge,iiface,iface,iiedge,iedge,n
+
+      tmortemp=0.d0
+c.....loop over all elements sharing this vertex
+      do inemo=1,nemo(imor)
+        ie=emo(1,inemo,imor)
+        sizei=size_e(ie)
+        cornernumber=emo(2,inemo,imor)
+        sface=0
+        sedge=0
+        do iiface=1,3
+          iface=f_c(iiface,cornernumber)
+c.........sface sums the number of nonconforming faces sharing this vertex on
+c         one element
+          if(cbc(iface,ie).eq.3)then
+            sface=sface+1
+          end if
+        end do
+c.......sedge sums the number of nonconforming edges sharing this vertex on
+c       one element
+        do iiedge=1,3
+          iedge=e_c(iiedge,cornernumber)
+          if(ncon_edge(iedge,ie))sedge=sedge+1
+        end do
+
+c.......each n indicates how many nonconforming faces and nonconforming
+c       edges share this vertex on an element, 
+
+        if(sface.eq.0)then
+          if(sedge.eq.0)then
+             n=8
+          elseif(sedge.eq.1)then
+             n=1
+          elseif(sedge.eq.2)then
+             n=2
+          elseif(sedge.eq.3)then
+             n=3
+          end if 
+        elseif (sface.eq.1)then
+          if (sedge.eq.1)then
+           n=5
+          else
+           n=4
+          end if
+        else if (sface.eq.2)then
+           n=6
+        else if(sface.eq.3)then
+           n=7
+        end if
+          
+c.......sum the intermediate pre-computed preconditioner values for 
+c       all elements
+        tmortemp=tmortemp+pcmor_cor(n,sizei)
+
+      end do
+
+c.....dpcmor(imor) is the value of the preconditioner on mortar point imor
+      dpcmor(imor)=1.d0/tmortemp
+
+      return
+      end 
+
+c------------------------------------------------------------------------
+      subroutine com_dpc(iside,iel,enumber,n,isize)
+c------------------------------------------------------------------------
+c     Compute preconditioner for local edge enumber of face iside 
+c     on element iel.
+c     isize is element size,
+c     n is one of five different configurations
+c     anc1, ac, anc2, anc0 are coefficients for different edges. 
+c     nc0 refers to nonconforming edge shared by two conforming faces
+c     nc1 refers to nonconforming edge shared by one nonconforming face
+c     nc2 refers to nonconforming edges shared by two nonconforming faces
+c     c refers to conforming edge
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer n, isize,iside,iel, enumber, nn1start, nn1end, nn2start, 
+     &        nn2end, jstart, jend, istart, iend, i, j, nn1, nn2, imor
+      double precision anc1,ac,anc2,anc0,temp
+
+c.....different local edges have different loop ranges 
+      if(enumber.eq.1)then
+        nn1start=1
+        nn1end=1
+        nn2start=1
+        nn2end=2
+        jstart=1
+        jend=1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.2) then
+        nn1start=1
+        nn1end=2
+        nn2start=2
+        nn2end=2
+        jstart=2
+        jend=lx1-1
+        istart=lx1
+        iend=lx1
+      elseif (enumber.eq.3) then
+        nn1start=2
+        nn1end=2
+        nn2start=1
+        nn2end=2
+        jstart=lx1
+        jend=lx1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.4) then
+        nn1start=1
+        nn1end=2
+        nn2start=1
+        nn2end=1
+        jstart=2
+        jend=lx1-1
+        istart=1
+        iend=1
+      end if
+
+c.....among the four elements sharing this edge
+
+c.....one has a smaller size
+      if(n.eq.1)then
+        anc1=2.d0
+        ac=1.d0
+        anc0=1.d0
+        anc2=0.d0
+
+c.....two (neighbored by a face) are of  smaller size
+      else if (n.eq.2)then
+        anc1=2.d0
+        ac=2.d0
+        anc0=0.d0
+        anc2=0.d0
+
+c.....two (neighbored by an edge) are of smaller size
+      else if (n.eq.3)then
+        anc2=2.d0
+        ac=2.d0
+        anc1=0.d0
+        anc0=0.d0
+
+c.....three are of smaller size
+      else if (n.eq.4)then
+        anc1=0.d0
+        ac=3.d0
+        anc2=1.d0
+        anc0=0.d0
+
+c.....on the boundary
+      else if (n.eq.0)then
+        anc1=1.d0
+        ac=1.d0
+        anc2=0.d0
+        anc0=0.d0
+      end if
+
+c.....edge interior
+      do nn2=nn2start,nn2end
+        do nn1=nn1start,nn1end
+          do j=jstart,jend
+            do i=istart,iend
+              imor=idmo(i,j,nn1,nn2,iside,iel)
+              temp=anc1* pcmor_nc1(i,j,nn1,nn2,isize) +
+     &             ac*  pcmor_c(i,j,isize+1)+
+     &             anc0*  pcmor_nc0(i,j,nn1,nn2,isize)+
+     &             anc2*pcmor_nc2(i,j,nn1,nn2,isize)
+                dpcmor(imor)=1.d0/temp
+              end do
+            end do
+          end do
+        end do
+
+c.......local edge 1
+        if (enumber.eq.1) then
+          imor=idmo(lx1,1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,1,1,1,isize) +
+     &         ac*  pcmor_c(lx1,1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,1,1,1,isize)+
+     &         anc2*pcmor_nc2(lx1,1,1,1,isize)
+c.......local edge 2
+        elseif (enumber.eq.2) then
+          imor=idmo(lx1,lx1,1,2,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,1,2,isize) +
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,lx1,1,2,isize)+
+     &         anc2*pcmor_nc2(lx1,lx1,1,2,isize)
+c.......local edge 3
+        elseif (enumber.eq.3) then
+          imor=idmo(lx1,lx1,2,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,2,1,isize) +
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,lx1,2,1,isize)+
+     &         anc2*pcmor_nc2(lx1,lx1,2,1,isize)
+c.......local edge 4
+        elseif (enumber.eq.4) then
+          imor=idmo(1,lx1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(1,lx1,1,1,isize) +
+     &         ac*  pcmor_c(1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(1,lx1,1,1,isize)+
+     &         anc2*pcmor_nc2(1,lx1,1,1,isize)
+        end if
+
+        dpcmor(imor)=1.d0/temp
+
+      return
+      end 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/setup.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/setup.f
new file mode 100644
index 0000000..f2d5ce3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/setup.f
@@ -0,0 +1,533 @@
+c-----------------------------------------------------------------
+      subroutine create_initial_grid        
+c------------------------------------------------------------------
+    
+      include 'header.h'
+
+      integer i
+
+      nelt=1
+      ntot=nelt*lx1*lx1*lx1 
+      tree(1)=1
+      mt_to_id(1)=1
+      do i=1,7,2
+        xc(i,1)=0.d0
+        xc(i+1,1)=1.d0
+      end do
+
+      do i=1,2
+        yc(i,1)=0.d0
+        yc(2+i,1)=1.d0
+        yc(4+i,1)=0.d0
+        yc(6+i,1)=1.d0
+      end do
+     
+      do i=1,4
+        zc(i,1)=0.d0
+        zc(4+i,1)=1.d0
+      end do
+  
+      do i=1,6
+        cbc(i,1)=0
+      end do
+
+      return
+
+      end
+
+c-----------------------------------------------------------------
+      subroutine coef
+c-----------------------------------------------------------------
+c
+c     generate 
+c
+c            - collocation points
+c            - weights
+c            - derivative matrices 
+c            - projection matrices
+c            - interpolation matrices 
+c
+c     associated with the 
+c
+c            - gauss-legendre lobatto mesh (suffix m1)
+c
+c----------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,k
+
+c.....for gauss-legendre lobatto mesh (suffix m1)
+c.....generate collocation points and weights 
+
+      zgm1(1)=-1.d0
+      zgm1(2)=-0.6546536707079771d0
+      zgm1(3)=0.d0
+      zgm1(4)= 0.6546536707079771d0
+      zgm1(5)=1.d0
+      wxm1(1)=0.1d0
+      wxm1(2)=49.d0/90.d0
+      wxm1(3)=32.d0/45.d0
+      wxm1(4)=wxm1(2)
+      wxm1(5)=0.1d0 
+
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            w3m1(i,j,k)=wxm1(i)*wxm1(j)*wxm1(k)
+          end do
+        end do
+      end do
+
+c.....generate derivative matrices
+
+      dxm1(1,1)=-5.0d0
+      dxm1(2,1)=-1.240990253030982d0
+      dxm1(3,1)= 0.375d0
+      dxm1(4,1)=-0.2590097469690172d0
+      dxm1(5,1)= 0.5d0
+      dxm1(1,2)= 6.756502488724238d0
+      dxm1(2,2)= 0.d0
+      dxm1(3,2)=-1.336584577695453d0
+      dxm1(4,2)= 0.7637626158259734d0
+      dxm1(5,2)=-1.410164177942427d0
+      dxm1(1,3)=-2.666666666666667d0
+      dxm1(2,3)= 1.745743121887939d0
+      dxm1(3,3)= 0.d0
+      dxm1(4,3)=-dxm1(2,3)
+      dxm1(5,3)=-dxm1(1,3)
+      do j=4,lx1
+        do i=1,lx1
+          dxm1(i,j)=-dxm1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1(i,j)=dxm1(j,i)
+        end do
+      end do
+
+c.....generate projection (mapping) matrices
+
+      qbnew(1,1,1)=-0.1772843218615690d0
+      qbnew(2,1,1)=9.375d-02
+      qbnew(3,1,1)=-3.700139242414530d-02
+      qbnew(1,2,1)= 0.7152146412463197d0
+      qbnew(2,2,1)=-0.2285757930375471d0
+      qbnew(3,2,1)= 8.333333333333333d-02
+      qbnew(1,3,1)= 0.4398680650316104d0
+      qbnew(2,3,1)= 0.2083333333333333d0
+      qbnew(3,3,1)=-5.891568407922938d-02
+      qbnew(1,4,1)= 8.333333333333333d-02
+      qbnew(2,4,1)= 0.3561799597042137d0
+      qbnew(3,4,1)=-4.854797457965334d-02
+      qbnew(1,5,1)= 0.d0
+      qbnew(2,5,1)=7.03125d-02
+      qbnew(3,5,1)=0.d0
+      
+      do j=1,lx1
+        do i=1,3
+          qbnew(i,j,2)=qbnew(4-i,lx1+1-j,1)
+        end do
+      end do 
+
+c.....generate interpolation matrices for mesh refinement
+
+      ixtmc1(1,1)=1.d0
+      ixtmc1(2,1)=0.d0
+      ixtmc1(3,1)=0.d0
+      ixtmc1(4,1)=0.d0
+      ixtmc1(5,1)=0.d0 
+      ixtmc1(1,2)= 0.3385078435248143d0
+      ixtmc1(2,2)= 0.7898516348912331d0
+      ixtmc1(3,2)=-0.1884018684471238d0
+      ixtmc1(4,2)= 9.202967302175333d-02
+      ixtmc1(5,2)=-3.198728299067715d-02
+      ixtmc1(1,3)=-0.1171875d0
+      ixtmc1(2,3)= 0.8840317166357952d0
+      ixtmc1(3,3)= 0.3125d0    
+      ixtmc1(4,3)=-0.118406716635795d0 
+      ixtmc1(5,3)= 0.0390625d0   
+      ixtmc1(1,4)=-7.065070066767144d-02
+      ixtmc1(2,4)= 0.2829703269782467d0 
+      ixtmc1(3,4)= 0.902687582732838d0
+      ixtmc1(4,4)=-0.1648516348912333d0 
+      ixtmc1(5,4)= 4.984442584781999d-02
+      ixtmc1(1,5)=0.d0
+      ixtmc1(2,5)=0.d0
+      ixtmc1(3,5)=1.d0 
+      ixtmc1(4,5)=0.d0
+      ixtmc1(5,5)=0.d0  
+      do j=1,lx1
+        do i=1,lx1
+          ixmc1(i,j)=ixtmc1(j,i)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixtmc2(i,j)=ixtmc1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixmc2(i,j)=ixtmc2(j,i)
+        end do
+      end do
+
+c.....solution interpolation matrix for mesh coarsening
+
+      map2(1)=-0.1179652785083428d0
+      map2(2)= 0.5505046330389332d0
+      map2(3)= 0.7024534364259963d0
+      map2(4)=-0.1972224518285866d0
+      map2(5)= 6.222966087199998d-02
+
+      do i=1,lx1
+        map4(i)=map2(lx1+1-i)
+      end do
+
+      return
+      end
+
+c-------------------------------------------------------------------
+      subroutine geom1
+c-------------------------------------------------------------------
+c
+c     routine to generate elemental geometry information on mesh m1,
+c     (gauss-legendre lobatto mesh).
+c
+c         xrm1_s   -   dx/dr, dy/dr, dz/dr
+c         rxm1_s   -   dr/dx, dr/dy, dr/dz
+c         g1m1_s  geometric factors used in preconditioner computation
+c         g4m1_s  g5m1_s  g6m1_s :
+c         geometric factors used in lapacian opertor
+c         jacm1    -   jacobian
+c         bm1      -   mass matrix
+c         xfrac    -   will be used in prepwork for calculating collocation
+c                      coordinates
+c         idel     -   collocation points index on element boundaries 
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision temp,temp1,temp2,dtemp
+      integer isize,i,j,k,ntemp,iel
+ 
+      do i=1,lx1
+        xfrac(i)=zgm1(i)*0.5d0 + 0.5d0
+      end do
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ISIZE,TEMP,TEMP1,TEMP2,
+c$OMP&  K,J,I,dtemp)
+      do isize=1,refine_max
+        temp=2.d0**(-isize-1)
+        dtemp=1.d0/temp
+        temp1=temp**3
+        temp2=temp**2
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              xrm1_s(i,j,k,isize)=dtemp
+              jacm1_s(i,j,k,isize)=temp1
+              rxm1_s(i,j,k,isize)=temp2
+              g1m1_s(i,j,k,isize)=w3m1(i,j,k)*temp
+              bm1_s(i,j,k,isize)=w3m1(i,j,k)*temp1
+              g4m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(i)
+              g5m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(j)
+              g6m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(k)
+            end do
+          end do
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ntemp,i,j,iel)
+      do iel = 1, lelt
+        ntemp=lx1*lx1*lx1*(iel-1)
+        do j = 1, lx1
+          do i = 1, lx1
+            idel(i,j,1,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+lx1
+            idel(i,j,2,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+1
+            idel(i,j,3,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+lx1*(lx1-1)+1
+            idel(i,j,4,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+1
+            idel(i,j,5,iel)=ntemp+(i-1)*1 + (j-1)*lx1+lx1*lx1*(lx1-1)+1
+            idel(i,j,6,iel)=ntemp+(i-1)*1 + (j-1)*lx1+1
+          end do
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine setdef
+c------------------------------------------------------------------
+c     compute the discrete laplacian operators
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,ip
+ 
+      call r_init(wdtdr(1,1),lx1*lx1,0.d0)
+
+      do i=1,lx1
+        do j=1,lx1
+          do ip=1,lx1
+            wdtdr(i,j) = wdtdr(i,j) + wxm1(ip)*dxm1(ip,i)*dxm1(ip,j)
+          end do
+        end do
+      end do
+
+      return 
+      end
+
+
+c------------------------------------------------------------------
+      subroutine prepwork
+c------------------------------------------------------------------
+c     mesh information preparations: calculate refinement levels of
+c     each element, mask matrix for domain boundary and element 
+c     boundaries
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, iel, iface, cb
+      double precision rdlog2
+
+      ntot = nelt*nxyz
+      rdlog2 = 1.d0/dlog(2.d0)
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(I,J,IEL,IFACE,CB)
+
+c.....calculate the refinement levels of each element
+
+c$OMP DO 
+      do iel = 1, nelt
+        size_e(iel)=-dlog(xc(2,iel)-xc(1,iel))*rdlog2+1.d-8
+      end do
+c$OMP END DO nowait
+
+c.....mask matrix for element boundary
+
+c$OMP DO
+      do iel = 1, nelt
+        call r_init(tmult(1,1,1,iel),nxyz,1.d0)   
+        do iface=1,nsides
+          call facev(tmult(1,1,1,iel),iface,0.0d0)
+        end do
+      end do
+c$OMP END DO nowait
+
+c.....masks for domain boundary at mortar 
+
+c$OMP DO
+      do iel=1,nmor
+        tmmor(iel)=1.d0
+      end do
+c$OMP END DO
+
+c$OMP DO
+      do iel = 1, nelt
+        do iface = 1,nsides
+          cb=cbc(iface,iel)
+          if(cb.eq.0) then
+            do j=2,lx1-1
+              do i=2,lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            end do
+
+            j=1
+            do i = 1, lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+
+            if(idmo(lx1,1,1,1,iface,iel).eq.0)then
+              tmmor(idmo(lx1,1,1,2,iface,iel))=0.d0
+            else
+              tmmor(idmo(lx1,1,1,1,iface,iel))=0.d0
+              do i=1,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=lx1
+            if(idmo(lx1,2,1,2,iface,iel).eq.0)then
+              do j=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+              tmmor(idmo(lx1,lx1,2,2,iface,iel))=0.d0
+            else
+              do j=2,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+              do j=1,lx1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+            
+            j=lx1
+            tmmor(idmo(1,lx1,2,1,iface,iel))=0.d0
+            if(idmo(2,lx1,2,1,iface,iel).eq.0)then
+              do i=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            else
+              do i=2,lx1
+                tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+              do i=1,lx1-1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=1
+            do j=2,lx1-1
+             tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+            if(idmo(1,lx1,1,1,iface,iel).ne.0)then
+              tmmor(idmo(i,lx1,1,1,iface,iel))=0.d0
+              do j=1,lx1-1
+               tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+            end if
+
+          endif
+        end do
+       end do
+c$OMP END DO nowait
+            
+c$OMP END PARALLEL
+      return
+      end 
+    
+
+c------------------------------------------------------------------
+      block data top_constants
+
+c------------------------------------------------------------------
+c.....We store some tables of useful topological constants
+c------------------------------------------------------------------
+      include 'header.h'
+
+c     f_e_ef(e,f) returns the other face sharing the e'th local edge of face f.
+      data f_e_ef/6,3,5,4, 6,3,5,4, 6,1,5,2, 6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+c.....e_c(n,j) returns n'th edge sharing the vertex j of an element
+      data e_c /5,8,11, 1,4,11,  5,6,9, 1,2,9,
+     &          7,8,12, 3,4,12, 6,7,10, 2,3,10/
+
+c.....local_corner(n,i) returns the local corner index of vertex n on face i
+      data local_corner /0,1,0,2,0,3,0,4, 1,0,2,0,3,0,4,0,
+     &                   0,0,1,2,0,0,3,4, 1,2,0,0,3,4,0,0,
+     &                   0,0,0,0,1,2,3,4, 1,2,3,4,0,0,0,0/
+
+c.....cal_nnb(n,i) returns the neighbor elements neighbored by n'th edge
+c     among the three edges sharing vertex i
+c     the elements are the eight children elements ordered as 1 to 8.
+      data cal_nnb/5,2,3, 6,1,4, 7,4,1, 8,3,2,
+     &             1,6,7, 2,5,8, 3,8,5, 4,7,6/
+
+c.....returns the opposite local corner index: 1-4,2-3
+      data oplc /4,3,2,1/
+
+c.....cal_iijj(i,n) returns the location of local corner number n on a face 
+c     i =1  to get ii, i=2 to get jj
+c     (ii,jj) is defined the same as in mortar location (ii,jj)
+      data cal_iijj /1,1, 1,2, 2,1, 2,2/
+
+c.....returns the adjacent(neighbored by a face) element's children,
+c     assumming a vertex is shared by eight child elements 1-8. 
+c     index n is local corner number on the face which is being 
+c     assigned the mortar index number
+      data cal_intempx /8,6,4,2, 7,5,3,1, 8,7,4,3, 
+     $                  6,5,2,1, 8,7,6,5, 4,3,2,1/
+
+c.....c_f(i,f) returns the vertex number of i'th local corner on face f
+      data c_f /2,4,6,8, 1,3,5,7, 3,4,7,8, 1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+c.....on each face of the parent element, there are four children element.
+c     le_arr(i,j,n) returns the i'th elements among the four children elements 
+c     n refers to the direction: 1 for x, 2 for y and 3 for z direction. 
+c     j refers to positive(0) or negative(1) direction on x, y or z direction.
+c     n=1,j=0 refers to face 1 and n=1, j=1 refers to face 2, n=2,j=0 refers to
+c     face 3.... 
+c     The current eight children are ordered as 8,1,2,3,4,5,6,7 
+      data    le_arr/8,2,4,6, 1,3,5,7, 
+     $               8,1,4,5, 2,3,6,7, 
+     $               8,1,2,3, 4,5,6,7/
+
+c.....jjface(n) returns the face opposite to face n
+      data jjface /2,1,4,3,6,5/
+
+cc.....edgeface(n,f) returns OTHER face which shares local edge n on face f
+c      integer edgeface(4,6)
+c      data edgeface /6,3,5,4, 6,3,5,4, 6,1,5,2, 
+c     $               6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+c.....e_face2(n,f) returns the local edge number of edge n on the
+c     other face sharing local edge n on face f
+      data e_face2 /2,2,2,2, 4,4,4,4, 3,2,3,2, 
+     $              1,4,1,4, 3,3,3,3, 1,1,1,1/
+
+c.....op(n) returns the local edge number of the edge which 
+c     is opposite to local edge n on the same face
+      data op /3,4,1,2/
+
+c.....localedgenumber(f,e) returns the local edge number for edge e
+c     on face f. A zero result value signifies illegal input
+      data localedgenumber /1,0,0,0,0,2, 2,0,2,0,0,0, 3,0,0,0,2,0, 
+     $                      4,0,0,2,0,0, 0,1,0,0,0,4, 0,2,4,0,0,0, 
+     $                      0,3,0,0,4,0, 0,4,0,4,0,0, 0,0,1,0,0,3, 
+     $                      0,0,3,0,3,0, 0,0,0,1,0,1, 0,0,0,3,1,0/
+
+c.....edgenumber(e,f) returns the edge index of local edge e on face f
+      data edgenumber / 1,2, 3,4,  5,6, 7,8,  9,2,10,6, 
+     $                 11,4,12,8, 12,3,10,7, 11,1, 9,5/
+
+c.....f_c(c,n) returns the face index of i'th face sharing vertex n 
+      data f_c /2,4,6, 1,4,6, 2,3,6, 1,3,6,
+     &          2,4,5, 1,4,5, 2,3,5, 1,3,5/
+
+c.....if two elements are neighbor by one edge, 
+c     e1v1(f1,f2) returns the smaller index of the two vertices on this 
+c     edge on one element
+c     e1v2 returns the larger index of the two vertices of this edge on 
+c     on element. exfor a vertex on element 
+c     e2v1 returns the smaller index of the two vertices on this edge on 
+c     another element
+c     e2v2 returns the larger index of the two vertiex on this edge on
+c     another element
+      data e1v1/0,0,4,2,6,2, 0,0,3,1,5,1, 4,3,0,0,7,3,
+     &          2,1,0,0,5,1, 6,5,7,5,0,0, 2,1,3,1,0,0/
+      data e2v1/0,0,1,3,1,5, 0,0,2,4,2,6, 1,2,0,0,1,5,
+     &          3,4,0,0,3,7, 1,2,1,3,0,0, 5,6,5,7,0,0/
+      data e1v2/0,0,8,6,8,4, 0,0,7,5,7,3, 8,7,0,0,8,4,
+     &          6,5,0,0,6,2, 8,7,8,6,0,0, 4,3,4,2,0,0/
+      data e2v2/0,0,5,7,3,7, 0,0,6,8,4,8, 5,6,0,0,2,6,
+     &          7,8,0,0,4,8, 3,4,2,4,0,0, 7,8,6,8,0,0/
+
+c.....children(n1,n)returns the four elements among the eight children 
+c     elements to be merged on face n of the parent element
+c     the IDs for the eight children are 1,2,3,4,5,6,7,8
+      data children/2,4,6,8, 1,3,5,7, 3,4,7,8, 
+     &              1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+c.....iijj(n1,n) returns the location of n's mortar on an element face
+c     n1=1 refers to x direction location and n1=2 refers to y direction
+      data iijj/1,1,1,2,2,1,2,2/
+
+c.....v_end(n) returns the index of collocation points at two ends of each
+c     direction
+      data v_end /1,lx1/
+
+c.....face_l1,face_l2,face_ld return for start,end,stride for a loop over faces 
+c     used on subroutine  mortar_vertex
+      data face_l1 /2,3,1/, face_l2 /3,1,2/, face_ld /1,-2,1/
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer.f
new file mode 100644
index 0000000..d5cc875
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer.f
@@ -0,0 +1,1102 @@
+c------------------------------------------------------------------
+      subroutine init_locks
+c------------------------------------------------------------------
+c     Initialize locks to be used for atomic updates
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i
+
+c.....initialize locks in parallel
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+c$    do i=1,lmor
+c$      call omp_init_lock(tlock(i))
+c$    end do
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+c------------------------------------------------------------------
+c     Map values from mortar(tmor) to element(tx)
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+c.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,
+c$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,tmp)
+      do ie=1,nelt
+        do iface=1,nsides
+
+c.........get the collocation point index of the four local corners on the
+c         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+c.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then
+
+c...........nonconforming faces have four pieces of mortar, first map them to
+c           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+c.................in each row col, when coloumn i=1 or lx1, the value
+c                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+c.................in each row col, value in the interior three collocation
+c                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) + 
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+c...........mapping from two pieces of intermediate mortar tmp to element 
+c           face tx
+
+            do ije1=1, nnje
+
+c.............the first column, col=1, is an edge of face iface.
+c             the value on the three interior collocation points, tx, is 
+c             computed by applying mapping matrices qbnew to tmp.
+c             the mapping result is divided by 2, because there will be 
+c             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+c.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+c...............when i=1 or lx1, the collocation points are also on an edge of
+c               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+c...............compute the value at interior collocation points in 
+c               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+c.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+c...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+c------------------------------------------------------------------
+c     Map from element(tx) to mortar(tmor).
+c     tmor sums contributions from all elements.
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,tx(*),tmor(*),temp(lx1,lx1,2),
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,
+c$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,ije,
+c$OMP& tmp,shift,temp,top,tmp1)
+
+c$OMP DO
+      do ie=1,nmor
+        tmor(ie)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+c.........sum the values from tx to tmor for these four local corners
+c         only 1/3 of the value is summed, since there will be two duplicated
+c         contributions from the other two faces sharing this vertex 
+c
+c$        call omp_set_lock(tlock(ig1))
+          tmor(ig1) = tmor(ig1)+tx(il1)*third
+c$        call omp_unset_lock(tlock(ig1))
+c
+c$        call omp_set_lock(tlock(ig2))
+          tmor(ig2) = tmor(ig2)+tx(il2)*third
+c$        call omp_unset_lock(tlock(ig2))
+c
+c$        call omp_set_lock(tlock(ig3))
+          tmor(ig3) = tmor(ig3)+tx(il3)*third
+c$        call omp_unset_lock(tlock(ig3))
+c
+c$        call omp_set_lock(tlock(ig4))
+          tmor(ig4) = tmor(ig4)+tx(il4)*third
+c$        call omp_unset_lock(tlock(ig4))
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+c...........nonconforming faces have four pieces of mortar, first map tx to
+c           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+c...............For mortar points on face edge (top and bottom), copy the 
+c               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+c...............For mortar points on face edge (top and bottom), calculate 
+c               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+c...............Use mapping matrices qbnew to map the value from tx to temp 
+c               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+c...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+c...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+c.................For the end point, which is on an edge (local edge 2,4), 
+c                 the contribution is halved since there will be duplicated 
+c                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+c$                call omp_unset_lock(tlock(ig))
+
+c.................In each row of collocation points on a piece of mortar, 
+c                 sum the contributions from interior collocation points 
+c                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+c
+c$                  call omp_set_lock(tlock(ig))
+                    tmor(ig)=tmor(ig)+tmp
+c$                  call omp_unset_lock(tlock(ig))
+                  end do
+                end do
+
+c...............For tmor on local edge 1 and 3, tmp is the contribution from
+c               an edge, so it is halved because of duplicated contribution
+c               from another face sharing this edge. tmp1 is contribution 
+c               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+top(v_end(ije2),ije1)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0+tmp1 
+c$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+c$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+c$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+c$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+c
+c$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+c$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+c...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+          end if
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL 
+
+      return
+      end
+
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the edge to mortar mapping and
+c     calculates the mapping result on the mortar point at a vertex
+c     under situation 1,2, or 3.
+c     n refers to the configuration of three edges sharing a vertex, 
+c     n = 1: only one edge is nonconforming
+c     n = 2: two edges are nonconforming 
+c     n = 3: three edges are nonconforming 
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the mapping from face to mortar.
+c     Output tmor is the mapping result on a mortar vertex
+c     of situations of three edges and three faces sharing a vertex:
+c     n=4: only one face is nonconforming 
+c     n=5: one face and one edge are nonconforming
+c     n=6: two faces are nonconforming 
+c     n=7: three faces are nonconforming 
+c--------------------------------------------------------------
+      include 'header.h'
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+c-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+c------------------------------------------------------------------------
+c     Perform mortar to element mapping on a nonconforming face. 
+c     This subroutin is used when all entries in tmor are zero except
+c     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+c     mortar  (tmor only has two indices) and one piece of intermediate 
+c     mortar (tmp) are involved.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+c------------------------------------------------------------------------
+c     Performs mapping from element to mortar when the nonconforming 
+c     edges are shared by two conforming faces of an element.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by two nonconforming faces of an element.
+c     Although each face shall have four pieces of mortar, only value in
+c     one piece (location (1,1)) is used in the calling routine so only
+c     the value in the first mortar is calculated in this subroutine.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+c.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+c.....from intermediate mortar to mortar
+
+c.....On the nonconforming edge, temp is divided by 2 as there will be
+c     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by a nonconforming face and a conforming face of an element
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+c.....Contribution from the nonconforming faces
+c     Since the calling subroutine is only interested in the value on the
+c     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+c.........temp is not divided by 2 here. It includes the contribution
+c         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+c-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for cg. All values from conforming 
+c     boundary are copied and summed on tmor.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,
+c$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL) 
+
+c$OMP DO
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+c
+c$          call omp_set_lock(tlock(ig1))
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+c$          call omp_unset_lock(tlock(ig1))
+c
+c$          call omp_set_lock(tlock(ig2))
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+c$          call omp_unset_lock(tlock(ig2))
+c
+c$          call omp_set_lock(tlock(ig3))
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+c$          call omp_unset_lock(tlock(ig3))
+c
+c$          call omp_set_lock(tlock(ig4))
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+c$          call omp_unset_lock(tlock(ig4))
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+          end if!
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL
+      return
+      end
+
+c-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for CG. All values from conforming 
+c     boundary are copied and summed in tmort. 
+c     mormult is multiplicity, which is used to average tmort.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,
+c$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL)
+
+c$OMP DO     
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+c$OMP END DO nowait
+c$OMP DO
+      do j=1,nmor
+        mormult(j)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO 
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+c
+c$          call omp_set_lock(tlock(ig1))
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+            mormult(ig1) = mormult(ig1)+third
+c$          call omp_unset_lock(tlock(ig1))
+c
+c$          call omp_set_lock(tlock(ig2))
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+            mormult(ig2) = mormult(ig2)+third
+c$          call omp_unset_lock(tlock(ig2))
+c
+c$          call omp_set_lock(tlock(ig3))
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+            mormult(ig3) = mormult(ig3)+third
+c$          call omp_unset_lock(tlock(ig3))
+c
+c$          call omp_set_lock(tlock(ig4))
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+            mormult(ig4) = mormult(ig4)+third
+c$          call omp_unset_lock(tlock(ig4))
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)
+                mormult(ig)=mormult(ig)+1.d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+c$              call omp_unset_lock(tlock(ig))
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c
+c$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+c$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+          end if!nnje=1
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer_au.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer_au.f
new file mode 100644
index 0000000..d3faf8f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/transfer_au.f
@@ -0,0 +1,1044 @@
+c------------------------------------------------------------------
+      subroutine init_locks
+c------------------------------------------------------------------
+c     This version uses ATOMIC for atomic updates, 
+c     but locks are still used in get_emo (mason.f).
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i
+
+c.....initialize locks in parallel
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+c$    do i=1,8*lelt
+c$      call omp_init_lock(tlock(i))
+c$    end do
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+c------------------------------------------------------------------
+c     Map values from mortar(tmor) to element(tx)
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+c.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,
+c$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,tmp)
+      do ie=1,nelt
+        do iface=1,nsides
+
+c.........get the collocation point index of the four local corners on the
+c         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+c.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then
+
+c...........nonconforming faces have four pieces of mortar, first map them to
+c           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+c.................in each row col, when coloumn i=1 or lx1, the value
+c                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+c.................in each row col, value in the interior three collocation
+c                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) + 
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+c...........mapping from two pieces of intermediate mortar tmp to element 
+c           face tx
+
+            do ije1=1, nnje
+
+c.............the first column, col=1, is an edge of face iface.
+c             the value on the three interior collocation points, tx, is 
+c             computed by applying mapping matrices qbnew to tmp.
+c             the mapping result is divided by 2, because there will be 
+c             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+c.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+c...............when i=1 or lx1, the collocation points are also on an edge of
+c               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+c...............compute the value at interior collocation points in 
+c               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+c.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+c...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+c------------------------------------------------------------------
+c     Map from element(tx) to mortar(tmor).
+c     tmor sums contributions from all elements.
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,tx(*),tmor(*),temp(lx1,lx1,2),
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,
+c$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,ije,
+c$OMP& tmp,shift,temp,top,tmp1)
+
+c$OMP DO
+      do ie=1,nmor
+        tmor(ie)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+c.........sum the values from tx to tmor for these four local corners
+c         only 1/3 of the value is summed, since there will be two duplicated
+c         contributions from the other two faces sharing this vertex 
+c$OMP ATOMIC
+          tmor(ig1) = tmor(ig1)+tx(il1)*third
+c$OMP ATOMIC
+          tmor(ig2) = tmor(ig2)+tx(il2)*third
+c$OMP ATOMIC
+          tmor(ig3) = tmor(ig3)+tx(il3)*third
+c$OMP ATOMIC
+          tmor(ig4) = tmor(ig4)+tx(il4)*third
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+c...........nonconforming faces have four pieces of mortar, first map tx to
+c           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+c...............For mortar points on face edge (top and bottom), copy the 
+c               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+c...............For mortar points on face edge (top and bottom), calculate 
+c               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+c...............Use mapping matrices qbnew to map the value from tx to temp 
+c               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+c...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+c...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+c.................For the end point, which is on an edge (local edge 2,4), 
+c                 the contribution is halved since there will be duplicated 
+c                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+
+c.................In each row of collocation points on a piece of mortar, 
+c                 sum the contributions from interior collocation points 
+c                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+c$OMP ATOMIC
+                    tmor(ig)=tmor(ig)+tmp
+                  end do
+                end do
+
+c...............For tmor on local edge 1 and 3, tmp is the contribution from
+c               an edge, so it is halved because of duplicated contribution
+c               from another face sharing this edge. tmp1 is contribution 
+c               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+top(v_end(ije2),ije1)*0.5d0
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0+tmp1 
+                end do
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)
+              end do
+            end do
+
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+c$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+          end if
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL 
+
+      return
+      end
+
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the edge to mortar mapping and
+c     calculates the mapping result on the mortar point at a vertex
+c     under situation 1,2, or 3.
+c     n refers to the configuration of three edges sharing a vertex, 
+c     n = 1: only one edge is nonconforming
+c     n = 2: two edges are nonconforming 
+c     n = 3: three edges are nonconforming 
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the mapping from face to mortar.
+c     Output tmor is the mapping result on a mortar vertex
+c     of situations of three edges and three faces sharing a vertex:
+c     n=4: only one face is nonconforming 
+c     n=5: one face and one edge are nonconforming
+c     n=6: two faces are nonconforming 
+c     n=7: three faces are nonconforming 
+c--------------------------------------------------------------
+      include 'header.h'
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+c-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+c------------------------------------------------------------------------
+c     Perform mortar to element mapping on a nonconforming face. 
+c     This subroutin is used when all entries in tmor are zero except
+c     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+c     mortar  (tmor only has two indices) and one piece of intermediate 
+c     mortar (tmp) are involved.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+c------------------------------------------------------------------------
+c     Performs mapping from element to mortar when the nonconforming 
+c     edges are shared by two conforming faces of an element.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by two nonconforming faces of an element.
+c     Although each face shall have four pieces of mortar, only value in
+c     one piece (location (1,1)) is used in the calling routine so only
+c     the value in the first mortar is calculated in this subroutine.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+c.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+c.....from intermediate mortar to mortar
+
+c.....On the nonconforming edge, temp is divided by 2 as there will be
+c     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by a nonconforming face and a conforming face of an element
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+c.....Contribution from the nonconforming faces
+c     Since the calling subroutine is only interested in the value on the
+c     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+c.........temp is not divided by 2 here. It includes the contribution
+c         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+c-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for cg. All values from conforming 
+c     boundary are copied and summed on tmor.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,
+c$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL) 
+
+c$OMP DO
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+c$OMP ATOMIC
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+c$OMP ATOMIC
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+c$OMP ATOMIC
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+c$OMP ATOMIC
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+          end if!
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL
+      return
+      end
+
+c-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for CG. All values from conforming 
+c     boundary are copied and summed in tmort. 
+c     mormult is multiplicity, which is used to average tmort.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,
+c$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL)
+
+c$OMP DO     
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+c$OMP END DO nowait
+c$OMP DO
+      do j=1,nmor
+        mormult(j)=0.d0
+      end do
+c$OMP END DO
+
+c$OMP DO 
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+c$OMP ATOMIC
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+c$OMP ATOMIC
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+c$OMP ATOMIC
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+c$OMP ATOMIC
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+c$OMP ATOMIC
+            mormult(ig1) = mormult(ig1)+third
+c$OMP ATOMIC
+            mormult(ig2) = mormult(ig2)+third
+c$OMP ATOMIC
+            mormult(ig3) = mormult(ig3)+third
+c$OMP ATOMIC
+            mormult(ig4) = mormult(ig4)+third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)
+c$OMP ATOMIC
+                mormult(ig)=mormult(ig)+1.d0
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+c$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+c$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+          end if!nnje=1
+        end do
+      end do
+c$OMP END DO NOWAIT
+c$OMP END PARALLEL
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/ua.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/ua.f
new file mode 100644
index 0000000..25c063b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/ua.f
@@ -0,0 +1,291 @@
+c-------------------------------------------------------------------------c
+c                                                                         c
+c        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         c
+c                                                                         c
+c                      O p e n M P     V E R S I O N                      c
+c                                                                         c
+c                                   U A                                   c
+c                                                                         c
+c-------------------------------------------------------------------------c
+c                                                                         c
+c    This benchmark is the OpenMP version of the NPB UA code.             c
+c    Refer to NAS Technical Report NAS--04-006 for details                c
+c                                                                         c
+c    Permission to use, copy, distribute and modify this software         c
+c    for any purpose with or without fee is hereby granted.  We           c
+c    request, however, that all derived work reference the NAS            c
+c    Parallel Benchmarks 3.3. This software is provided "as is"           c
+c    without express or implied warranty.                                 c
+c                                                                         c
+c    Information on NPB 3.3, including the technical report, the          c
+c    original specifications, source code, results and information        c
+c    on how to submit new results, is available at:                       c
+c                                                                         c
+c           http://www.nas.nasa.gov/Software/NPB/                         c
+c                                                                         c
+c    Send comments or suggestions to  npb@nas.nasa.gov                    c
+c                                                                         c
+c          NAS Parallel Benchmarks Group                                  c
+c          NASA Ames Research Center                                      c
+c          Mail Stop: T27A-1                                              c
+c          Moffett Field, CA   94035-1000                                 c
+c                                                                         c
+c          E-mail:  npb@nas.nasa.gov                                      c
+c          Fax:     (650) 604-3957                                        c
+c                                                                         c
+c-------------------------------------------------------------------------c
+
+c---------------------------------------------------------------------
+c
+c Author: H. Feng
+c         R. Van der Wijngaart
+c---------------------------------------------------------------------
+
+      program ua
+      include 'header.h'
+
+      integer          step, ie,iside,i,j, fstatus,k
+      external         timer_read
+      double precision timer_read, mflops, tmax, nelt_tot
+      character        class
+      logical          ifmortar, verified
+!$    integer          omp_get_max_threads
+!$    external         omp_get_max_threads
+
+      double precision t2, trecs(t_last)
+      character t_names(t_last)*10
+
+c---------------------------------------------------------------------
+c     Read input file (if it exists), else take
+c     defaults from parameters
+c---------------------------------------------------------------------
+          
+      open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_init) = 'init'
+         t_names(t_convect) = 'convect'
+         t_names(t_transfb_c) = 'transfb_c'
+         t_names(t_diffusion) = 'diffusion'
+         t_names(t_transf) = 'transf'
+         t_names(t_transfb) = 'transfb'
+         t_names(t_adaptation) = 'adaptation'
+         t_names(t_transf2) = 'transf+b'
+         t_names(t_add2) = 'add2'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+      write (*,1000) 
+      open (unit=2,file='inputua.data',status='old', iostat=fstatus)
+
+      if (fstatus .eq. 0) then
+        write(*,233) 
+ 233    format(' Reading from input file inputua.data')
+        read (2,*) fre
+        read (2,*) niter
+        read (2,*) nmxh
+        read (2,*) alpha
+        class = 'U'
+        close(2)
+      else
+        write(*,234) 
+        fre        = fre_default
+        niter      = niter_default
+        nmxh       = nmxh_default
+        alpha      = alpha_default
+        class      = class_default
+      endif
+ 234  format(' No input file inputua.data. Using compiled defaults')
+
+      dlmin = 0.5d0**refine_max
+      dtime = 0.04d0*dlmin
+
+      write (*,1001) refine_max
+      write (*,1002) fre
+      write (*,1003) niter, dtime
+      write (*,1004) nmxh
+      write (*,1005) alpha
+!$    write (*,1006) omp_get_max_threads()
+      write (*,*)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-OMP)',
+     >          ' - UA Benchmark', /)
+ 1001 format(' Levels of refinement:        ', i8)
+ 1002 format(' Adaptation frequency:        ', i8)
+ 1003 format(' Time steps:                  ', i8, '    dt: ', g15.6)
+ 1004 format(' CG iterations:               ', i8)
+ 1005 format(' Heat source radius:          ', f8.4)
+ 1006 format(' Number of available threads: ', i8)
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      if (timeron) call timer_start(t_init)
+
+c.....set up initial mesh (single element) and solution (all zero)
+      call create_initial_grid
+
+      call r_init_omp(ta1,ntot,0.d0)
+      call nr_init_omp(sje,4*6*nelt,0)
+
+      call init_locks
+
+c.....compute tables of coefficients and weights      
+      call coef 
+      call geom1
+
+c.....compute the discrete laplacian operators
+      call setdef
+
+c.....prepare for the preconditioner
+      call setpcmo_pre
+
+c.....refine initial mesh and do some preliminary work
+      time = 0.d0
+      call mortar
+      call prepwork
+      call adaptation(ifmortar,0)
+      if (timeron) call timer_stop(t_init)
+
+      call timer_clear(1)
+
+      time = 0.d0
+      do step= 0, niter
+
+        if (step .eq. 1) then
+c.........reset the solution and start the timer, keep track of total no elms
+
+          call r_init(ta1,ntot,0.d0)
+
+#ifdef HOOKS
+       call roi_begin
+#endif
+
+          time = 0.d0
+          nelt_tot = 0.d0
+          do i = 1, t_last
+             if (i.ne.t_init) call timer_clear(i)
+          end do
+          call timer_start(1)          
+        endif
+
+c.......advance the convection step 
+        call convect(ifmortar)
+
+        if (timeron) call timer_start(t_transf2)
+c.......prepare the intital guess for cg
+        call transf(tmort,ta1)
+
+c.......compute residual for diffusion term based on intital guess
+
+c.......compute the left hand side of equation, lapacian t
+c$OMP PARALLEL DEFAULT(SHARED) PRIVATE(ie,k,j,i) 
+c$OMP DO 
+        do ie = 1,nelt
+          call laplacian(ta2(1,1,1,ie),ta1(1,1,1,ie),size_e(ie))
+        end do
+c$OMP END DO 
+c.......compute the residual 
+c$OMP DO
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                trhs(i,j,k,ie) = trhs(i,j,k,ie) - ta2(i,j,k,ie)
+              end do
+            end do
+          end do
+        end do
+c$OMP END DO
+c$OMP END PARALLEL
+c.......get the residual on mortar 
+        call transfb(rmor,trhs)
+        if (timeron) call timer_stop(t_transf2)
+
+c.......apply boundary condition: zero out the residual on domain boundaries
+
+c.......apply boundary conidtion to trhs
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,iside)
+        do ie=1,nelt  
+          do iside=1,nsides
+            if (cbc(iside,ie).eq.0) then
+              call facev(trhs(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+c$OMP END PARALLELDO
+c.......apply boundary condition to rmor
+        call col2(rmor,tmmor,nmor)
+
+c.......call the conjugate gradient iterative solver
+        call diffusion(ifmortar)
+
+c.......add convection and diffusion
+        if (timeron) call timer_start(t_add2)
+        call add2(ta1,t,ntot)
+        if (timeron) call timer_stop(t_add2)
+
+        
+c.......perform mesh adaptation
+        time=time+dtime
+        if ((step.ne.0).and.(step/fre*fre .eq. step)) then
+           if (step .ne. niter) then
+             call adaptation(ifmortar,step)
+           end if
+        else
+          ifmortar = .false.
+        end if
+        nelt_tot = nelt_tot + dble(nelt)
+      end do
+
+      call timer_stop(1)
+      tmax = timer_read(1)
+
+#ifdef HOOKS
+       call roi_end
+#endif
+
+      call verify(class, verified)
+
+c.....compute millions of collocation points advanced per second.
+c.....diffusion: nmxh advancements, convection: 1 advancement
+      mflops = nelt_tot*dble(lx1*lx1*lx1*(nmxh+1))/(tmax*1.d6)
+
+      call print_results('UA', class, refine_max, 0, 0, niter, 
+     &     tmax, mflops, '    coll. point advanced', 
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     &     cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         if (i.eq.t_transfb_c) then
+            t2 = trecs(t_convect) - trecs(t_transfb_c)
+            write(*,820) 'sub-convect', t2, t2*100./tmax
+         else if (i.eq.t_transfb) then
+            t2 = trecs(t_diffusion) - trecs(t_transf) - trecs(t_transfb)
+            write(*,820) 'sub-diffuse', t2, t2*100./tmax
+         endif
+ 810     format(2x,a10,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a11,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/utils.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/utils.f
new file mode 100644
index 0000000..28c98ef
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/utils.f
@@ -0,0 +1,366 @@
+c------------------------------------------------------------------
+      subroutine reciprocal (a, n)
+c------------------------------------------------------------------
+c     initialize double precision array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n)
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = 1.d0/a(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+c------------------------------------------------------------------
+      subroutine r_init_omp (a, n, const)
+c------------------------------------------------------------------
+c     initialize double precision array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n), const
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+c------------------------------------------------------------------
+      subroutine r_init (a, n, const)
+c------------------------------------------------------------------
+c     initialize double precision array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+c------------------------------------------------------------------
+      subroutine nr_init_omp (a, n, const)
+c------------------------------------------------------------------
+c     initialize integer array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i, a(n), const
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine nr_init (a, n, const)
+c------------------------------------------------------------------
+c     initialize integer array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i, a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+c------------------------------------------------------------------
+      subroutine l_init_omp (a, n, const)
+c------------------------------------------------------------------
+c     initialize integer array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+      integer n, i
+      logical a(n), const
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine ncopy (a,b,n)
+c------------------------------------------------------------------
+c     copy array of integers b to a, the length of array is n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      integer a(n), b(n)
+
+      do i = 1, n
+        a(i) = b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine copy (a,b,n)
+c------------------------------------------------------------------
+c     copy double precision array b to a, the length of array is n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n), b(n)
+
+      do i = 1, n
+         a(i) = b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine adds2m1(a,b,c1,n)
+c-----------------------------------------------------------------
+c     a=b*c1
+c-----------------------------------------------------------------
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)+c1*b(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine adds1m1(a,b,c1,n )
+c-----------------------------------------------------------------
+c     a=c1*a+b
+c-----------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=c1*a(i)+b(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine col2(a,b,n)
+c------------------------------------------------------------------
+c     a=a*b
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n)
+
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)*b(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine nrzero (na,n)
+c------------------------------------------------------------------
+c     zero out array of integers 
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i,na(n)
+
+      do i = 1, n
+        na(i ) = 0
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine add2(a,b,n)
+c------------------------------------------------------------------
+c     a=a+b
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision  a(n),b(n)
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)+b(i)
+      end do
+c$OMP END PARALLEL DO
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      double precision function calc_norm()
+c------------------------------------------------------------------
+c     calculate the integral of ta1 over the whole domain
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision total,ieltotal
+      integer iel,k,j,i,isize
+
+      total=0.d0
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k,isize,ieltotal,iel)
+c$OMP& REDUCTION(+:total)
+
+      do iel=1,nelt
+        ieltotal=0.d0
+        isize=size_e(iel)
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              ieltotal=ieltotal+ta1(i,j,k,iel)*w3m1(i,j,k)
+     &                               *jacm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      total=total+ieltotal
+      end do
+c$OMP END PARALLEL DO
+
+      calc_norm = total
+
+      return
+      end
+c-----------------------------------------------------------------
+      subroutine parallel_add(frontier)
+c-----------------------------------------------------------------
+c     input array frontier, perform (potentially) parallel add so that
+c     the output frontier(i) has sum of frontier(1)+frontier(2)+...+frontier(i)
+c-----------------------------------------------------------------
+      include 'header.h'
+      integer nellog,i,ahead,ii,ntemp,n1,ntemp1,frontier(lelt),iel
+
+      nellog=0
+      iel=1
+   10 iel=iel*2
+      nellog=nellog+1
+      if (iel.lt.nelt) goto 10
+
+      ntemp=1
+      do i=1,nellog
+        n1=ntemp*2
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ahead,ii,iel)
+        do iel=n1, nelt,n1
+          ahead=frontier(iel-ntemp)
+          do ii=ntemp-1,0,-1
+            frontier(iel-ii)=frontier(iel-ii)+ahead
+          end do
+        end do
+c$OMP END PARALLEL DO
+
+        iel=(nelt/n1+1)*n1
+        ntemp1=iel-nelt
+        if(ntemp1.lt.ntemp)then
+          ahead=frontier(iel-ntemp)
+c$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ii)
+          do ii=ntemp-1,ntemp1,-1
+            frontier(iel-ii)=frontier(iel-ii)+ahead
+          end do
+c$OMP END PARALLEL DO
+        end if
+
+        ntemp=n1
+      end do
+
+      return
+      end 
+
+c------------------------------------------------------------------
+      subroutine dssum
+
+c------------------------------------------------------------------
+c     Perform stiffness summation: element-mortar-element mapping
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      call transfb(dpcmor,dpcelm)
+      call transf (dpcmor,dpcelm)
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine facev(a,iface,val)
+c------------------------------------------------------------------
+c     assign the value val to face(iface,iel) of array a.
+c------------------------------------------------------------------
+      include 'header.h'
+
+      double precision a(lx1,lx1,lx1), val
+      integer iface, kx1, kx2, ky1, ky2, kz1, kz2, ix, iy, iz
+
+      kx1=1
+      ky1=1
+      kz1=1
+      kx2=lx1
+      ky2=lx1
+      kz2=lx1
+      if (iface.eq.1) kx1=lx1
+      if (iface.eq.2) kx2=1
+      if (iface.eq.3) ky1=lx1
+      if (iface.eq.4) ky2=1
+      if (iface.eq.5) kz1=lx1
+      if (iface.eq.6) kz2=1
+
+      do ix = kx1, kx2
+        do iy = ky1, ky2
+          do iz = kz1, kz2
+            a(ix,iy,iz)=val
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/verify.f
new file mode 100644
index 0000000..189080a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/UA/verify.f
@@ -0,0 +1,88 @@
+      subroutine verify(class, verified)
+
+      include 'header.h'
+
+      double precision norm, calc_norm, epsilon, norm_dif, norm_ref
+      external         calc_norm
+      character        class
+      logical          verified
+       
+c.....tolerance level
+      epsilon = 1.0d-08
+
+c.....compute the temperature integral over the whole domain
+      norm = calc_norm()
+
+      verified = .true.
+      if     ( class .eq. 'S' ) then
+        norm_ref = 0.1890013110962D-02
+      elseif ( class .eq. 'W' ) then
+        norm_ref = 0.2569794837076D-04
+      elseif ( class .eq. 'A' ) then
+        norm_ref = 0.8939996281443D-04
+      elseif ( class .eq. 'B' ) then
+        norm_ref = 0.4507561922901D-04
+      elseif ( class .eq. 'C' ) then
+        norm_ref = 0.1544736587100D-04
+      elseif ( class .eq. 'D' ) then
+        norm_ref = 0.1577586272355D-05
+      else
+        class = 'U'
+        norm_ref = 1.d0
+        verified = .false.
+      endif         
+
+      norm_dif = dabs((norm - norm_ref)/norm_ref)
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+      print *
+
+      if (class .ne. 'U') then
+         write(*, 1990) class
+ 1990    format(' Verification being performed for class ', a)
+         write (*,2000) epsilon
+ 2000    format(' accuracy setting for epsilon = ', E20.13)
+      else 
+         write(*, 1995)
+ 1995    format(' Unknown class')
+      endif
+
+      if (class .ne. 'U') then
+         write (*,2001) 
+      else
+         write (*, 2005)
+      endif
+
+ 2001 format(' Comparison of temperature integrals')
+ 2005 format(' Temperature integral')
+      if (class .eq. 'U') then
+         write(*, 2015) norm
+      else if (norm_dif .le. epsilon) then
+         write (*,2011) norm, norm_ref, norm_dif
+      else 
+         verified = .false.
+         write (*,2010) norm, norm_ref, norm_dif
+      endif
+
+ 2010 format(' FAILURE: ', E20.13, E20.13, E20.13)
+ 2011 format('          ', E20.13, E20.13, E20.13)
+ 2015 format('          ', E20.13)
+        
+      if (class .eq. 'U') then
+        write(*, 2022)
+        write(*, 2023)
+ 2022   format(' No reference values provided')
+ 2023   format(' No verification performed')
+      else if (verified) then
+        write(*, 2020)
+ 2020   format(' Verification Successful')
+      else
+        write(*, 2021)
+ 2021   format(' Verification failed')
+      endif
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_print_results.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_print_results.c
new file mode 100644
index 0000000..b8a38ea
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_print_results.c
@@ -0,0 +1,114 @@
+/*****************************************************************/
+/******     C  _  P  R  I  N  T  _  R  E  S  U  L  T  S     ******/
+/*****************************************************************/
+#include <stdlib.h>
+#include <stdio.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags )
+{
+    int num_threads, max_threads;
+
+
+    max_threads = 1;
+    num_threads = 1;
+
+/*   figure out number of threads used */
+#ifdef _OPENMP
+    max_threads = omp_get_max_threads();
+#pragma omp parallel shared(num_threads)
+{
+    #pragma omp master
+    num_threads = omp_get_num_threads();
+}
+#endif
+
+
+    printf( "\n\n %s Benchmark Completed\n", name ); 
+
+    printf( " Class           =                        %c\n", class );
+
+    if( n3 == 0 ) {
+        long nn = n1;
+        if ( n2 != 0 ) nn *= n2;
+        printf( " Size            =             %12ld\n", nn );   /* as in IS */
+    }
+    else
+        printf( " Size            =             %4dx%4dx%4d\n", n1,n2,n3 );
+
+    printf( " Iterations      =             %12d\n", niter );
+ 
+    printf( " Time in seconds =             %12.2f\n", t );
+
+    printf( " Total threads   =             %12d\n", num_threads);
+
+    printf( " Avail threads   =             %12d\n", max_threads);
+
+    if (num_threads != max_threads) 
+        printf( " Warning: Threads used differ from threads available\n");
+
+    printf( " Mop/s total     =             %12.2f\n", mops );
+
+    printf( " Mop/s/thread    =             %12.2f\n",
+           mops/(double)num_threads );
+
+    printf( " Operation type  = %24s\n", optype);
+
+    if( passed_verification < 0 )
+        printf( " Verification    =            NOT PERFORMED\n" );
+    else if( passed_verification )
+        printf( " Verification    =               SUCCESSFUL\n" );
+    else
+        printf( " Verification    =             UNSUCCESSFUL\n" );
+
+    printf( " Version         =             %12s\n", npbversion );
+
+    printf( " Compile date    =             %12s\n", compiletime );
+
+    printf( "\n Compile options:\n" );
+
+    printf( "    CC           = %s\n", cc );
+
+    printf( "    CLINK        = %s\n", clink );
+
+    printf( "    C_LIB        = %s\n", c_lib );
+
+    printf( "    C_INC        = %s\n", c_inc );
+
+    printf( "    CFLAGS       = %s\n", cflags );
+
+    printf( "    CLINKFLAGS   = %s\n", clinkflags );
+
+    printf( "\n\n" );
+    printf( " Please send all errors/feedbacks to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " npb@nas.nasa.gov\n\n\n" );
+/*    printf( " Please send the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " Internet: npb@nas.nasa.gov\n \n" );
+    printf( " If email is not available, send this to:\n\n" );
+    printf( " MS T27A-1\n" );
+    printf( " NASA Ames Research Center\n" );
+    printf( " Moffett Field, CA  94035-1000\n\n" );
+    printf( " Fax: 650-604-3957\n\n" ); */
+}
+ 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_timers.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_timers.c
new file mode 100644
index 0000000..dd770af
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/c_timers.c
@@ -0,0 +1,72 @@
+#include "wtime.h"
+#include <stdlib.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+/*  Prototype  */
+void wtime( double * );
+
+
+/*****************************************************************/
+/******         E  L  A  P  S  E  D  _  T  I  M  E          ******/
+/*****************************************************************/
+double elapsed_time( void )
+{
+    double t;
+
+#if defined(_OPENMP) && (_OPENMP > 200010)
+/*  Use the OpenMP timer if we can */
+    t = omp_get_wtime();
+#else
+    wtime( &t );
+#endif
+    return( t );
+}
+
+
+static double start[64], elapsed[64];
+#ifdef _OPENMP
+#pragma omp threadprivate(start, elapsed)
+#endif
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  C  L  E  A  R          ******/
+/*****************************************************************/
+void timer_clear( int n )
+{
+    elapsed[n] = 0.0;
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  A  R  T          ******/
+/*****************************************************************/
+void timer_start( int n )
+{
+    start[n] = elapsed_time();
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  O  P             ******/
+/*****************************************************************/
+void timer_stop( int n )
+{
+    double t, now;
+
+    now = elapsed_time();
+    t = now - start[n];
+    elapsed[n] += t;
+
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  R  E  A  D             ******/
+/*****************************************************************/
+double timer_read( int n )
+{
+    return( elapsed[n] );
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/hooks.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/hooks.c
new file mode 100644
index 0000000..b5c91d5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/hooks.c
@@ -0,0 +1,23 @@
+#include <stdio.h>
+#include "m5_mmap.h"
+
+
+void init() __attribute__((constructor));
+
+void init() {
+
+	//__attribute__ makes this function get called before main()
+	// need to mmap /dev/mem
+	map_m5_mem();
+}
+
+void roi_begin_(){
+
+	printf(" -------------------- ROI BEGIN -------------------- \n");
+	m5_work_begin(0,0);
+	}
+
+void roi_end_(){
+       	printf(" -------------------- ROI END -------------------- \n");
+	m5_work_end(0,0);
+	}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.c
new file mode 100644
index 0000000..79de59b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2011, 2017 ARM Limited
+ * All rights reserved
+ *
+ * The license below extends only to copyright in the software and shall
+ * not be construed as granting a license to any other intellectual
+ * property including but not limited to intellectual property relating
+ * to a hardware implementation of the functionality of the software
+ * licensed hereunder.  You may use the software subject to the license
+ * terms below provided that you ensure that this notice is replicated
+ * unmodified and in its entirety in all distributions of the software,
+ * modified or unmodified, in source code or in binary form.
+ *
+ * Copyright (c) 2003-2005 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include "m5_mmap.h"
+
+void *m5_mem = NULL;
+
+void
+map_m5_mem()
+{
+#ifdef M5OP_ADDR
+    int fd;
+
+    fd = open("/dev/mem", O_RDWR | O_SYNC);
+    if (fd == -1) {
+        perror("Can't open /dev/mem");
+        exit(1);
+    }
+
+    m5_mem = mmap(NULL, 0x10000, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
+                  M5OP_ADDR);
+    if (!m5_mem) {
+        perror("Can't mmap /dev/mem");
+        exit(1);
+    }
+#endif
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.h
new file mode 100644
index 0000000..d32857f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5_mmap.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (c) 2011, 2017 ARM Limited
+ * All rights reserved
+ *
+ * The license below extends only to copyright in the software and shall
+ * not be construed as granting a license to any other intellectual
+ * property including but not limited to intellectual property relating
+ * to a hardware implementation of the functionality of the software
+ * licensed hereunder.  You may use the software subject to the license
+ * terms below provided that you ensure that this notice is replicated
+ * unmodified and in its entirety in all distributions of the software,
+ * modified or unmodified, in source code or in binary form.
+ *
+ * Copyright (c) 2003-2005 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __UTIL_M5_MMAP_H__
+#define __UTIL_M5_MMAP_H__
+
+#include <fcntl.h>
+#include <sys/mman.h>
+
+extern void *m5_mem;
+
+void map_m5_mem();
+
+#endif
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5op_x86.S b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5op_x86.S
new file mode 100644
index 0000000..2a8abbb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/m5op_x86.S
@@ -0,0 +1,101 @@
+/*
+ * Copyright (c) 2003-2006 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Authors: Gabe Black
+ *          Nathan Binkert
+ *          Ali Saidi
+ */
+
+#include <gem5/asm/generic/m5ops.h>
+
+#if defined(M5OP_ADDR) && defined(M5OP_PIC)
+/* Use the memory mapped m5op interface */
+#define TWO_BYTE_OP(name, number)         \
+        .globl name;                      \
+        .func name;                       \
+name:                                     \
+        mov m5_mem@gotpcrel(%rip), %r11;  \
+        mov (%r11), %r11;                 \
+        mov $number, %rax;                \
+        shl $8, %rax;                     \
+        mov 0(%r11, %rax, 1), %rax;       \
+        ret;                              \
+        .endfunc;
+
+#elif defined(M5OP_ADDR) && !defined(M5OP_PIC)
+/* Use the memory mapped m5op interface */
+#define TWO_BYTE_OP(name, number)         \
+        .globl name;                      \
+        .func name;                       \
+name:                                     \
+        mov m5_mem, %r11;                 \
+        mov $number, %rax;                \
+        shl $8, %rax;                     \
+        mov 0(%r11, %rax, 1), %rax;       \
+        ret;                              \
+        .endfunc;
+
+#else
+/* Use the magic instruction based m5op interface. This does not work
+ * in virtualized environments.
+ */
+
+#define TWO_BYTE_OP(name, number)         \
+        .globl name;                      \
+        .func name;                       \
+name:                                     \
+        .byte 0x0F, 0x04;                 \
+        .word number;                     \
+        ret;                              \
+        .endfunc;
+
+#endif
+
+TWO_BYTE_OP(m5_arm, M5OP_ARM)
+TWO_BYTE_OP(m5_quiesce, M5OP_QUIESCE)
+TWO_BYTE_OP(m5_quiesce_ns, M5OP_QUIESCE_NS)
+TWO_BYTE_OP(m5_quiesce_cycle, M5OP_QUIESCE_CYCLE)
+TWO_BYTE_OP(m5_quiesce_time, M5OP_QUIESCE_TIME)
+TWO_BYTE_OP(m5_rpns, M5OP_RPNS)
+TWO_BYTE_OP(m5_wake_cpu, M5OP_WAKE_CPU)
+TWO_BYTE_OP(m5_exit, M5OP_EXIT)
+TWO_BYTE_OP(m5_fail, M5OP_FAIL)
+TWO_BYTE_OP(m5_init_param, M5OP_INIT_PARAM)
+TWO_BYTE_OP(m5_load_symbol, M5OP_LOAD_SYMBOL)
+TWO_BYTE_OP(m5_reset_stats, M5OP_RESET_STATS)
+TWO_BYTE_OP(m5_dump_stats, M5OP_DUMP_STATS)
+TWO_BYTE_OP(m5_dump_reset_stats, M5OP_DUMP_RESET_STATS)
+TWO_BYTE_OP(m5_checkpoint, M5OP_CHECKPOINT)
+TWO_BYTE_OP(m5_read_file, M5OP_READ_FILE)
+TWO_BYTE_OP(m5_write_file, M5OP_WRITE_FILE)
+TWO_BYTE_OP(m5_debug_break, M5OP_DEBUG_BREAK)
+TWO_BYTE_OP(m5_switch_cpu, M5OP_SWITCH_CPU)
+TWO_BYTE_OP(m5_add_symbol, M5OP_ADD_SYMBOL)
+TWO_BYTE_OP(m5_panic, M5OP_PANIC)
+TWO_BYTE_OP(m5_work_begin, M5OP_WORK_BEGIN)
+TWO_BYTE_OP(m5_work_end, M5OP_WORK_END)
+TWO_BYTE_OP(m5_dist_toggle_sync, M5OP_DIST_TOGGLE_SYNC)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/print_results.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/print_results.f
new file mode 100644
index 0000000..0337bf1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/print_results.f
@@ -0,0 +1,136 @@
+
+      subroutine print_results(name, class, n1, n2, n3, niter, 
+     >               t, mops, optype, verified, npbversion, 
+     >               compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      
+      implicit none
+      character name*(*)
+      character class*1
+      integer   n1, n2, n3, niter, j
+      double precision t, mops
+      character optype*24, size*15
+      logical   verified
+      character*(*) npbversion, compiletime, 
+     >              cs1, cs2, cs3, cs4, cs5, cs6, cs7
+      integer   num_threads, max_threads, i
+c$    integer omp_get_num_threads, omp_get_max_threads
+c$    external omp_get_num_threads, omp_get_max_threads
+
+
+      max_threads = 1
+c$    max_threads = omp_get_max_threads()
+
+c     figure out number of threads used
+      num_threads = 1
+c$omp parallel shared(num_threads)
+c$omp master
+c$    num_threads = omp_get_num_threads()
+c$omp end master
+c$omp end parallel
+
+
+         write (*, 2) name
+ 2       format(//, ' ', A, ' Benchmark Completed.')
+
+         write (*, 3) Class
+ 3       format(' Class           = ', 12x, a12)
+
+c   If this is not a grid-based problem (EP, FT, CG), then
+c   we only print n1, which contains some measure of the
+c   problem size. In that case, n2 and n3 are both zero.
+c   Otherwise, we print the grid size n1xn2xn3
+
+         if ((n2 .eq. 0) .and. (n3 .eq. 0)) then
+            if (name(1:2) .eq. 'EP') then
+               write(size, '(f15.0)' ) 2.d0**n1
+               j = 15
+               if (size(j:j) .eq. '.') j = j - 1
+               write (*,42) size(1:j)
+ 42            format(' Size            = ',9x, a15)
+            else
+               write (*,44) n1
+ 44            format(' Size            = ',12x, i12)
+            endif
+         else
+            write (*, 4) n1,n2,n3
+ 4          format(' Size            =  ',9x, i4,'x',i4,'x',i4)
+         endif
+
+         write (*, 5) niter
+ 5       format(' Iterations      = ', 12x, i12)
+         
+         write (*, 6) t
+ 6       format(' Time in seconds = ',12x, f12.2)
+
+         write (*,7) num_threads
+ 7       format(' Total threads   = ', 12x, i12)
+         
+         write (*,8) max_threads
+ 8       format(' Avail threads   = ', 12x, i12)
+
+         if (num_threads .ne. max_threads) write (*,88) 
+ 88      format(' Warning: Threads used differ from threads available')
+
+         write (*,9) mops
+ 9       format(' Mop/s total     = ',12x, f12.2)
+
+         write (*,10) mops/float( num_threads )
+ 10      format(' Mop/s/thread    = ', 12x, f12.2)        
+
+         write(*, 11) optype
+ 11      format(' Operation type  = ', a24)
+
+         if (verified) then 
+            write(*,12) '  SUCCESSFUL'
+         else
+            write(*,12) 'UNSUCCESSFUL'
+         endif
+ 12      format(' Verification    = ', 12x, a)
+
+         write(*,13) npbversion
+ 13      format(' Version         = ', 12x, a12)
+
+         write(*,14) compiletime
+ 14      format(' Compile date    = ', 12x, a12)
+
+
+         write (*,121) cs1
+ 121     format(/, ' Compile options:', /, 
+     >          '    F77          = ', A)
+
+         write (*,122) cs2
+ 122     format('    FLINK        = ', A)
+
+         write (*,123) cs3
+ 123     format('    F_LIB        = ', A)
+
+         write (*,124) cs4
+ 124     format('    F_INC        = ', A)
+
+         write (*,125) cs5
+ 125     format('    FFLAGS       = ', A)
+
+         write (*,126) cs6
+ 126     format('    FLINKFLAGS   = ', A)
+
+         write(*, 127) cs7
+ 127     format('    RAND         = ', A)
+        
+         write (*,130)
+ 130     format(//' Please send all errors/feedbacks to:'//
+     >            ' NPB Development Team'/
+     >            ' npb@nas.nasa.gov'//)
+c 130     format(//' Please send the results of this run to:'//
+c     >            ' NPB Development Team '/
+c     >            ' Internet: npb@nas.nasa.gov'/
+c     >            ' '/
+c     >            ' If email is not available, send this to:'//
+c     >            ' MS T27A-1'/
+c     >            ' NASA Ames Research Center'/
+c     >            ' Moffett Field, CA  94035-1000'//
+c     >            ' Fax: 650-604-3957'//)
+
+
+         return
+         end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdp.f
new file mode 100644
index 0000000..64860d9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdp.f
@@ -0,0 +1,137 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function randlc (x, a)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+
+      return
+      end
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to VRANLC using the same arguments will generate a
+c   continuous sequence.  If N is zero, only initialization is performed, and
+c   the variables X, A and Y are ignored.
+c
+c   This routine is the standard version designed for scalar or RISC systems.
+c   However, it should produce the same results on any single processor
+c   computer with at least 48 mantissa bits in double precision floating point
+c   data.  On 64 bit systems, double precision should be disabled.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer i,n
+      double precision y,r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      dimension y(*)
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Generate N results.   This loop is not vectorizable.
+c---------------------------------------------------------------------
+      do i = 1, n
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+        t1 = r23 * x
+        x1 = int (t1)
+        x2 = x - t23 * x1
+        t1 = a1 * x2 + a2 * x1
+        t2 = int (r23 * t1)
+        z = t1 - t23 * t2
+        t3 = t23 * z + a2 * x2
+        t4 = int (r46 * t3)
+        x = t3 - t46 * t4
+        y(i) = r46 * x
+      enddo
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdpvec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdpvec.f
new file mode 100644
index 0000000..c708071
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randdpvec.f
@@ -0,0 +1,186 @@
+c---------------------------------------------------------------------
+      double precision function randlc (x, a)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+
+
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c   
+c   x_{k+1} = a x_k  (mod 2^46)
+c   
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to RANDLC using the same arguments will generate a
+c   continuous sequence.
+c   
+c   This routine generates the output sequence in batches of length NV, for
+c   convenience on vector computers.  This routine should produce the same
+c   results on any computer with at least 48 mantissa bits in double precision
+c   floating point data.  On Cray systems, double precision should be disabled.
+c   
+c   David H. Bailey    August 30, 1990
+c---------------------------------------------------------------------
+
+      integer n
+      double precision x, a, y(*)
+      
+      double precision r23, r46, t23, t46
+      integer nv
+      parameter (r23 = 2.d0 ** (-23), r46 = r23 * r23, t23 = 2.d0 ** 23,
+     >     t46 = t23 * t23, nv = 64)
+      double precision  xv(nv), t1, t2, t3, t4, an, a1, a2, x1, x2, yy
+      integer n1, i, j
+      external randlc
+      double precision randlc
+
+c---------------------------------------------------------------------
+c     Compute the first NV elements of the sequence using RANDLC.
+c---------------------------------------------------------------------
+      t1 = x
+      n1 = min (n, nv)
+
+      do  i = 1, n1
+         xv(i) = t46 * randlc (t1, a)
+      enddo
+
+c---------------------------------------------------------------------
+c     It is not necessary to compute AN, A1 or A2 unless N is greater than NV.
+c---------------------------------------------------------------------
+      if (n .gt. nv) then
+
+c---------------------------------------------------------------------
+c     Compute AN = AA ^ NV (mod 2^46) using successive calls to RANDLC.
+c---------------------------------------------------------------------
+         t1 = a
+         t2 = r46 * a
+
+         do  i = 1, nv - 1
+            t2 = randlc (t1, a)
+         enddo
+
+         an = t46 * t2
+
+c---------------------------------------------------------------------
+c     Break AN into two parts such that AN = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+         t1 = r23 * an
+         a1 = aint (t1)
+         a2 = an - t23 * a1
+      endif
+
+c---------------------------------------------------------------------
+c     Compute N pseudorandom results in batches of size NV.
+c---------------------------------------------------------------------
+      do  j = 0, n - 1, nv
+         n1 = min (nv, n - j)
+
+c---------------------------------------------------------------------
+c     Compute up to NV results based on the current seed vector XV.
+c---------------------------------------------------------------------
+         do  i = 1, n1
+            y(i+j) = r46 * xv(i)
+         enddo
+
+c---------------------------------------------------------------------
+c     If this is the last pass through the 140 loop, it is not necessary to
+c     update the XV vector.
+c---------------------------------------------------------------------
+         if (j + n1 .eq. n) goto 150
+
+c---------------------------------------------------------------------
+c     Update the XV vector by multiplying each element by AN (mod 2^46).
+c---------------------------------------------------------------------
+         do  i = 1, nv
+            t1 = r23 * xv(i)
+            x1 = aint (t1)
+            x2 = xv(i) - t23 * x1
+            t1 = a1 * x2 + a2 * x1
+            t2 = aint (r23 * t1)
+            yy = t1 - t23 * t2
+            t3 = t23 * yy + a2 * x2
+            t4 = aint (r46 * t3)
+            xv(i) = t3 - t46 * t4
+         enddo
+
+      enddo
+
+c---------------------------------------------------------------------
+c     Save the last seed in X so that subsequent calls to VRANLC will generate
+c     a continuous sequence.
+c---------------------------------------------------------------------
+ 150  x = xv(n1)
+
+      return
+      end
+
+c----- end of program ------------------------------------------------
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8.f
new file mode 100644
index 0000000..21ab881
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8.f
@@ -0,0 +1,79 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+      Lx = X
+      La = A
+
+      Lx   = iand(Lx*La,i246m1)
+      randlc = d2m46*dble(Lx)
+      x    = dble(Lx)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+c This doesn't work, because the compiler does the calculation in 32
+c bits and overflows. No standard way (without f90 stuff) to specify
+c that the rhs should be done in 64 bit arithmetic. 
+c      parameter(i246m1=2**46-1)
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+c Note that the v6 compiler on an R8000 does something stupid with
+c the above. Using the following instead (or various other things)
+c makes the calculation run almost 10 times as fast. 
+c 
+c      save d2m46
+c      data d2m46/0.0d0/
+c      if (d2m46 .eq. 0.0d0) then
+c         d2m46 = 0.5d0**46
+c      endif
+
+      Lx = X
+      La = A
+      do i = 1, N
+         Lx   = iand(Lx*La,i246m1)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x    = dble(Lx)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8_safe.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8_safe.f
new file mode 100644
index 0000000..f725b6a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/randi8_safe.f
@@ -0,0 +1,64 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = x
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      x1 = ibits(Lx, 23, 23)
+      x2 = ibits(Lx, 0, 23)
+      xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+      Lx   = ibits(xa,0, 46)
+      x    = dble(Lx)
+      randlc = d2m46*x
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = X
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      do i = 1, N
+         x1 = ibits(Lx, 23, 23)
+         x2 = ibits(Lx, 0, 23)
+         xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+         Lx   = ibits(xa,0, 46)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x = dble(Lx)
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/timers.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/timers.f
new file mode 100644
index 0000000..6e707a4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/timers.f
@@ -0,0 +1,122 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine timer_clear(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+c$omp threadprivate(/tt/)
+
+      elapsed(n) = 0.0
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_start(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      external         elapsed_time
+      double precision elapsed_time
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+c$omp threadprivate(/tt/)
+
+      start(n) = elapsed_time()
+
+      return
+      end
+      
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_stop(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      external         elapsed_time
+      double precision elapsed_time
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+c$omp threadprivate(/tt/)
+      double precision t, now
+      now = elapsed_time()
+      t = now - start(n)
+      elapsed(n) = elapsed(n) + t
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function timer_read(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+c$omp threadprivate(/tt/)
+      
+      timer_read = elapsed(n)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function elapsed_time()
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+c$    external         omp_get_wtime
+c$    double precision omp_get_wtime
+
+      double precision t
+      logical          mp
+
+c ... Use the OpenMP timer if we can (via C$ conditional compilation)
+      mp = .false.
+c$    mp = .true.
+c$    t = omp_get_wtime()
+
+      if (.not.mp) then
+c This function must measure wall clock time, not CPU time. 
+c Since there is no portable timer in Fortran (77)
+c we call a routine compiled in C (though the C source may have
+c to be tweaked). 
+         call wtime(t)
+c The following is not ok for "official" results because it reports
+c CPU time not wall clock time. It may be useful for developing/testing
+c on timeshared Crays, though. 
+c        call second(t)
+      endif
+
+      elapsed_time = t
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.c
new file mode 100644
index 0000000..b5dcdaa
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.c
@@ -0,0 +1,16 @@
+#include "wtime.h"
+#include <time.h>
+#ifndef DOS
+#include <sys/time.h>
+#endif
+
+void wtime(double *t)
+{
+   /* a generic timer */
+   static int sec = -1;
+   struct timeval tv;
+   gettimeofday(&tv, (void *)0);
+   if (sec < 0) sec = tv.tv_sec;
+   *t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.h
new file mode 100644
index 0000000..12eb0cb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime.h
@@ -0,0 +1,12 @@
+/* C/Fortran interface is different on different machines. 
+ * You may need to tweak this.
+ */
+
+
+#if defined(IBM)
+#define wtime wtime
+#elif defined(CRAY)
+#define wtime WTIME
+#else
+#define wtime wtime_
+#endif
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime_sgi64.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime_sgi64.c
new file mode 100644
index 0000000..d08d50c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/common/wtime_sgi64.c
@@ -0,0 +1,74 @@
+#include <sys/types.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syssgi.h>
+#include <sys/immu.h>
+#include <errno.h>
+#include <stdio.h>
+
+/* The following works on SGI Power Challenge systems */
+
+typedef unsigned long iotimer_t;
+
+unsigned int cycleval;
+volatile iotimer_t *iotimer_addr, base_counter;
+double resolution;
+
+/* address_t is an integer type big enough to hold an address */
+typedef unsigned long address_t;
+
+
+
+void timer_init() 
+{
+  
+  int fd;
+  char *virt_addr;
+  address_t phys_addr, page_offset, pagemask, pagebase_addr;
+  
+  pagemask = getpagesize() - 1;
+  errno = 0;
+  phys_addr = syssgi(SGI_QUERY_CYCLECNTR, &cycleval);
+  if (errno != 0) {
+    perror("SGI_QUERY_CYCLECNTR");
+    exit(1);
+  }
+  /* rel_addr = page offset of physical address */
+  page_offset = phys_addr & pagemask;
+  pagebase_addr = phys_addr - page_offset;
+  fd = open("/dev/mmem", O_RDONLY);
+
+  virt_addr = mmap(0, pagemask, PROT_READ, MAP_PRIVATE, fd, pagebase_addr);
+  virt_addr = virt_addr + page_offset;
+  iotimer_addr = (iotimer_t *)virt_addr;
+  /* cycleval in picoseconds to this gives resolution in seconds */
+  resolution = 1.0e-12*cycleval; 
+  base_counter = *iotimer_addr;
+}
+
+void wtime_(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
+void wtime(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/README
new file mode 100644
index 0000000..ae535e9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/README
@@ -0,0 +1,7 @@
+This directory contains examples of make.def files that were used 
+by the NPB team in testing the benchmarks on different platforms. 
+They can be used as starting points for make.def files for your 
+own platform, but you may need to taylor them for best performance 
+on your installation. A clean template can be found in directory 
+`config'.
+Some examples of suite.def files are also provided.
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def.gcc_x86 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def.gcc_x86
new file mode 100644
index 0000000..eb60541
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def.gcc_x86
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = gfortran
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -fopenmp -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -fopenmp -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm
new file mode 100644
index 0000000..7613bd2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = xlf_r
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+#F_LIB  = -lmass
+F_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qnosave -qsmp=omp
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -qsmp=omp -bmaxdata:0x80000000 -bmaxstack:0x10000000
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = xlc_r
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qsmp=omp
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -qsmp=omp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm64
new file mode 100644
index 0000000..65021fd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_ibm64
@@ -0,0 +1,167 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = xlf_r -q64
+
+#---------------------------------------------------------------------------
+# This links fortran programs; usually the same as ${F77}
+#---------------------------------------------------------------------------
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qsmp=omp -qarch=auto -qtune=auto -qhot -qnosave
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -qsmp=omp -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = xlc_r -q64
+
+#---------------------------------------------------------------------------
+# This links C programs; usually the same as ${CC}
+#---------------------------------------------------------------------------
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qsmp=omp -qarch=auto -qtune=auto -qhot
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -qsmp=omp -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_intel b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_intel
new file mode 100644
index 0000000..e8a1335
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_intel
@@ -0,0 +1,149 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = ifort
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -openmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -openmp
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = icc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -openmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -openmp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= icc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_omni b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_omni
new file mode 100644
index 0000000..f9cae7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_omni
@@ -0,0 +1,156 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = ompf77
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -xO4 -fast
+#FFLAGS	= -O3
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -xO4 -fast
+#FLINKFLAGS = -O3
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = ompcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -xO4 -fast
+#CFLAGS	= -O3
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -xO4 -fast
+#CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_pgi b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_pgi
new file mode 100644
index 0000000..e51e9e4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_pgi
@@ -0,0 +1,149 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = pgf90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -mp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -mp
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = pgcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -mp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -mp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= pgcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi
new file mode 100644
index 0000000..1269eb4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -mp
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -mp
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -mp
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -mp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi64
new file mode 100644
index 0000000..54bcc62
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sgi64
@@ -0,0 +1,153 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+# (Note these definitions are inconsistent with NPB2.1.)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77 -64
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -mp
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -mp
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc -64
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -mp
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -mp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime_sgi64.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun
new file mode 100644
index 0000000..0e11ab6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast -openmp
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -fast -openmp
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fast -xopenmp
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -fast -xopenmp
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun64
new file mode 100644
index 0000000..63c2657
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/make.def_sun64
@@ -0,0 +1,151 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast -openmp -xarch=native64
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -fast -openmp -xarch=native64
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fast -xopenmp -xarch=native64
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -fast -xopenmp -xarch=native64
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.bt b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.bt
new file mode 100644
index 0000000..66d59b0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.bt
@@ -0,0 +1,6 @@
+bt	S
+bt	W
+bt	A
+bt	B
+bt	C
+bt	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.cg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.cg
new file mode 100644
index 0000000..c960817
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.cg
@@ -0,0 +1,6 @@
+cg	S
+cg	W
+cg	A
+cg	B
+cg	C
+cg	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ep b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ep
new file mode 100644
index 0000000..a0491d3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ep
@@ -0,0 +1,6 @@
+ep	S
+ep	W
+ep	A
+ep	B
+ep	C
+ep	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ft b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ft
new file mode 100644
index 0000000..100ae4f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.ft
@@ -0,0 +1,6 @@
+ft	S
+ft	W
+ft	A
+ft	B
+ft	C
+ft	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.is b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.is
new file mode 100644
index 0000000..3a0b05d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.is
@@ -0,0 +1,5 @@
+is	S
+is	W
+is	A
+is	B
+is	C
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.lu b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.lu
new file mode 100644
index 0000000..583de7e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.lu
@@ -0,0 +1,6 @@
+lu	S
+lu	W
+lu	A
+lu	B
+lu	C
+lu	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.mg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.mg
new file mode 100644
index 0000000..1df86a9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.mg
@@ -0,0 +1,6 @@
+mg	S
+mg	W
+mg	A
+mg	B
+mg	C
+mg	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.sp b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.sp
new file mode 100644
index 0000000..8b5a9ba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/NAS.samples/suite.def.sp
@@ -0,0 +1,6 @@
+sp	S
+sp	W
+sp	A
+sp	B
+sp	C
+sp	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def
new file mode 100644
index 0000000..6378b3d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def
@@ -0,0 +1,165 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS.
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran
+#
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = gfortran
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp -mcmodel=medium -cpp -DM5OP_ADDR=0xFFFF0000
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable
+# size usually go here.
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -fopenmp -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp -mcmodel=medium -DM5OP_ADDR=0xFFFF0000
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable
+# size usually go here.
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -fopenmp -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by
+# this compiler go here also; typically there are few flags required; hence
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. .
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator
+# is used. It is described in detail in README.install.
+# Use "randi8" unless there is a reason to use another one.
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM:
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+#---------------------------------------------------------------------------
+# Destination of gem5 directory, relative to sub dirs of the main directory. - @TODO Update for the final commit. (../../gem5)
+#---------------------------------------------------------------------------
+GEM5DIR	= ../../../../../../gem5-EXP/gem5/
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def.template
new file mode 100644
index 0000000..18a753e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/make.def.template
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def
new file mode 100755
index 0000000..7330ab3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def
@@ -0,0 +1,50 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command.
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file.
+# Each line of this file contains a benchmark name and the class.
+# The name is one of "cg", "is", "dc", "ep", mg", "ft", "sp",
+#  "bt", "lu", and "ua".
+# The class is one of "S", "W", "A" through "E"
+# (except that no classes C,D,E for DC and no class E for IS and UA).
+# No blank lines.
+# The following example builds sample sizes of all benchmarks.
+ft      A
+mg      A
+sp      A
+lu      A
+bt      A
+is      A
+ep      A
+cg      A
+ua      A
+
+ft      B
+mg      B
+sp      B
+lu      B
+bt      B
+is      B
+ep      B
+cg      B
+ua      B
+
+ft      C
+mg      C
+sp      C
+lu      C
+bt      C
+is      C
+ep      C
+cg      C
+ua      C
+
+ft      D
+mg      D
+sp      D
+lu      D
+bt      D
+is      D
+ep      D
+cg      D
+ua      D
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def.template
new file mode 100644
index 0000000..327026b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/config/suite.def.template
@@ -0,0 +1,21 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command. 
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file. 
+# Each line of this file contains a benchmark name and the class.
+# The name is one of "cg", "is", "dc", "ep", mg", "ft", "sp",
+#  "bt", "lu", and "ua". 
+# The class is one of "S", "W", "A" through "E" 
+# (except that no classes C,D,E for DC and no class E for IS and UA).
+# No blank lines. 
+# The following example builds sample sizes of all benchmarks. 
+ft	S
+mg	S
+sp	S
+lu	S
+bt	S
+is	S
+ep	S
+cg	S
+ua	S
+dc      S
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5op_flags.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5op_flags.h
new file mode 100644
index 0000000..de44e00
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5op_flags.h
@@ -0,0 +1,54 @@
+/*
+ * Copyright (c) 2017 ARM Limited
+ * All rights reserved
+ *
+ * The license below extends only to copyright in the software and shall
+ * not be construed as granting a license to any other intellectual
+ * property including but not limited to intellectual property relating
+ * to a hardware implementation of the functionality of the software
+ * licensed hereunder.  You may use the software subject to the license
+ * terms below provided that you ensure that this notice is replicated
+ * unmodified and in its entirety in all distributions of the software,
+ * modified or unmodified, in source code or in binary form.
+ *
+ * Copyright (c) 2003-2006 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Authors: Nathan Binkert
+ *          Ali Saidi
+ *          Andreas Sandberg
+ */
+
+#ifndef __GEM5_ASM_GENERIC_M5OP_FLAGS_H__
+#define __GEM5_ASM_GENERIC_M5OP_FLAGS_H__
+
+/* Flags for annotation calls */
+#define M5_AN_FL_NONE   0x0
+#define M5_AN_FL_BAD    0x2
+#define M5_AN_FL_LINK   0x10
+#define M5_AN_FL_RESET  0x20
+
+#endif //  __GEM5_ASM_GENERIC_M5OP_FLAGS_H__
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5ops.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5ops.h
new file mode 100644
index 0000000..f175596
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/asm/generic/m5ops.h
@@ -0,0 +1,114 @@
+/*
+ * Copyright (c) 2016 ARM Limited
+ * All rights reserved
+ *
+ * The license below extends only to copyright in the software and shall
+ * not be construed as granting a license to any other intellectual
+ * property including but not limited to intellectual property relating
+ * to a hardware implementation of the functionality of the software
+ * licensed hereunder.  You may use the software subject to the license
+ * terms below provided that you ensure that this notice is replicated
+ * unmodified and in its entirety in all distributions of the software,
+ * modified or unmodified, in source code or in binary form.
+ *
+ * Copyright (c) 2003-2006 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __GEM5_ASM_GENERIC_M5OPS_H__
+#define __GEM5_ASM_GENERIC_M5OPS_H__
+
+#define M5OP_ARM                0x00
+#define M5OP_QUIESCE            0x01
+#define M5OP_QUIESCE_NS         0x02
+#define M5OP_QUIESCE_CYCLE      0x03
+#define M5OP_QUIESCE_TIME       0x04
+#define M5OP_RPNS               0x07
+#define M5OP_WAKE_CPU           0x09
+#define M5OP_DEPRECATED1        0x10 // obsolete ivlb
+#define M5OP_DEPRECATED2        0x11 // obsolete ivle
+#define M5OP_DEPRECATED3        0x20 // deprecated exit function
+#define M5OP_EXIT               0x21
+#define M5OP_FAIL               0x22
+#define M5OP_INIT_PARAM         0x30
+#define M5OP_LOAD_SYMBOL        0x31
+#define M5OP_RESET_STATS        0x40
+#define M5OP_DUMP_STATS         0x41
+#define M5OP_DUMP_RESET_STATS   0x42
+#define M5OP_CHECKPOINT         0x43
+#define M5OP_WRITE_FILE         0x4F
+#define M5OP_READ_FILE          0x50
+#define M5OP_DEBUG_BREAK        0x51
+#define M5OP_SWITCH_CPU         0x52
+#define M5OP_ADD_SYMBOL         0x53
+#define M5OP_PANIC              0x54
+
+#define M5OP_RESERVED1          0x55 // Reserved for user, used to be annotate
+#define M5OP_RESERVED2          0x56 // Reserved for user
+#define M5OP_RESERVED3          0x57 // Reserved for user
+#define M5OP_RESERVED4          0x58 // Reserved for user
+#define M5OP_RESERVED5          0x59 // Reserved for user
+
+#define M5OP_WORK_BEGIN         0x5a
+#define M5OP_WORK_END           0x5b
+
+#define M5OP_SE_SYSCALL         0x60
+#define M5OP_SE_PAGE_FAULT      0x61
+#define M5OP_DIST_TOGGLE_SYNC   0x62
+
+
+#define M5OP_FOREACH                                            \
+    M5OP(m5_arm, M5OP_ARM)                                      \
+    M5OP(m5_quiesce, M5OP_QUIESCE)                              \
+    M5OP(m5_quiesce_ns, M5OP_QUIESCE_NS)                        \
+    M5OP(m5_quiesce_cycle, M5OP_QUIESCE_CYCLE)                  \
+    M5OP(m5_quiesce_time, M5OP_QUIESCE_TIME)                    \
+    M5OP(m5_rpns, M5OP_RPNS)                                    \
+    M5OP(m5_wake_cpu, M5OP_WAKE_CPU)                            \
+    M5OP(m5_exit, M5OP_EXIT)                                    \
+    M5OP(m5_fail, M5OP_FAIL)                                    \
+    M5OP(m5_init_param, M5OP_INIT_PARAM)                        \
+    M5OP(m5_load_symbol, M5OP_LOAD_SYMBOL)                      \
+    M5OP(m5_reset_stats, M5OP_RESET_STATS)                      \
+    M5OP(m5_dump_stats, M5OP_DUMP_STATS)                        \
+    M5OP(m5_dump_reset_stats, M5OP_DUMP_RESET_STATS)            \
+    M5OP(m5_checkpoint, M5OP_CHECKPOINT)                        \
+    M5OP(m5_write_file, M5OP_WRITE_FILE)                        \
+    M5OP(m5_read_file, M5OP_READ_FILE)                          \
+    M5OP(m5_debug_break, M5OP_DEBUG_BREAK)                      \
+    M5OP(m5_switch_cpu, M5OP_SWITCH_CPU)                        \
+    M5OP(m5_add_symbol, M5OP_ADD_SYMBOL)                        \
+    M5OP(m5_panic, M5OP_PANIC)                                  \
+    M5OP(m5_work_begin, M5OP_WORK_BEGIN)                        \
+    M5OP(m5_work_end, M5OP_WORK_END)                            \
+    M5OP(m5_se_syscall, M5OP_SE_SYSCALL)                        \
+    M5OP(m5_se_page_fault, M5OP_SE_PAGE_FAULT)                  \
+    M5OP(m5_dist_toggle_sync, M5OP_DIST_TOGGLE_SYNC)
+
+#define M5OP_MERGE_TOKENS_I(a, b) a##b
+#define M5OP_MERGE_TOKENS(a, b) M5OP_MERGE_TOKENS_I(a, b)
+
+#endif //  __GEM5_ASM_GENERIC_M5OPS_H__
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/m5ops.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/m5ops.h
new file mode 100644
index 0000000..3edd4e6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/include-gem5/gem5/m5ops.h
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2003-2006 The Regents of The University of Michigan
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __GEM5_M5OP_H__
+#define __GEM5_M5OP_H__
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdint.h>
+
+void m5_arm(uint64_t address);
+void m5_quiesce(void);
+void m5_quiesce_ns(uint64_t ns);
+void m5_quiesce_cycle(uint64_t cycles);
+uint64_t m5_quiesce_time(void);
+uint64_t m5_rpns();
+void m5_wake_cpu(uint64_t cpuid);
+
+void m5_exit(uint64_t ns_delay);
+void m5_fail(uint64_t ns_delay, uint64_t code);
+uint64_t m5_init_param(uint64_t key_str1, uint64_t key_str2);
+void m5_checkpoint(uint64_t ns_delay, uint64_t ns_period);
+void m5_reset_stats(uint64_t ns_delay, uint64_t ns_period);
+void m5_dump_stats(uint64_t ns_delay, uint64_t ns_period);
+void m5_dump_reset_stats(uint64_t ns_delay, uint64_t ns_period);
+uint64_t m5_read_file(void *buffer, uint64_t len, uint64_t offset);
+uint64_t m5_write_file(void *buffer, uint64_t len, uint64_t offset,
+                       const char *filename);
+void m5_debug_break(void);
+void m5_switch_cpu(void);
+void m5_dist_toggle_sync(void);
+void m5_add_symbol(uint64_t addr, const char *symbol);
+void m5_load_symbol();
+void m5_panic(void);
+void m5_work_begin(uint64_t workid, uint64_t threadid);
+void m5_work_end(uint64_t workid, uint64_t threadid);
+
+void m5_se_syscall();
+void m5_se_page_fault();
+
+#ifdef __cplusplus
+}
+#endif
+#endif // __GEM5_M5OP_H__
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/Makefile
new file mode 100644
index 0000000..b0bf4e9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/Makefile
@@ -0,0 +1,22 @@
+UCC = cc
+include ../config/make.def
+
+# Note that COMPILE is also defined in make.common and should
+# be the same. We can't include make.common because it has a lot
+# of other garbage. 
+FCOMPILE = $(F77) -c $(F_INC) $(FFLAGS)
+
+all: setparams 
+
+# setparams creates an npbparam.h file for each benchmark 
+# configuration. npbparams.h also contains info about how a benchmark
+# was compiled and linked
+
+setparams: setparams.c ../config/make.def
+	$(UCC) ${CONVERTFLAG} -o setparams setparams.c
+
+
+clean: 
+	-rm -f setparams setparams.h npbparams.h
+	-rm -f *~ *.o
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/README
new file mode 100644
index 0000000..ede69b5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/README
@@ -0,0 +1,41 @@
+This directory contains utilities and files used by the 
+build process. You should not need to change anything
+in this directory. 
+
+Original Files
+--------------
+setparams.c:
+        Source for the setparams program. This program is used internally
+        in the build process to create the file "npbparams.h" for each 
+        benchmark. npbparams.h contains Fortran or C parameters to build a 
+        benchmark for a specific class. The setparams program is never run 
+        directly by a user. Its invocation syntax is 
+
+            "setparams benchmark-name class". 
+
+        It examines the file "npbparams.h" in the current directory. If 
+        the specified parameters are the same as those in the npbparams.h 
+        file, nothing it changed. If the file does not exist or corresponds 
+        to a different class/number of nodes, it is (re)built. 
+	One of the more complicated things in npbparams.h is that it 
+        contains, in a Fortran string, the compiler flags used to build a 
+        benchmark, so that a benchmark can print out how it was compiled. 
+
+make.common
+        A makefile segment that is included in each individual benchmark
+        program makefile. It sets up some standard macros (COMPILE, etc) 
+        and makes sure everything is configured correctly (npbparams.h)
+
+Makefile
+        Builds  setparams
+
+README
+        This file. 
+
+
+Created files
+-------------
+
+setparams
+	See descriptions above
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/make.common b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/make.common
new file mode 100644
index 0000000..692597c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/make.common
@@ -0,0 +1,65 @@
+PROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).x
+FCOMPILE = $(F77) -c $(F_INC) $(FFLAGS)
+CCOMPILE = $(CC)  -c $(C_INC) $(CFLAGS)
+
+# Class "U" is used internally by the setparams program to mean
+# "unknown". This means that if you don't specify CLASS=
+# on the command line, you'll get an error. It would be nice
+# to be able to avoid this, but we'd have to get information
+# from the setparams back to the make program, which isn't easy.
+CLASS=U
+
+default:: ${PROGRAM}
+
+# This makes sure the configuration utility setparams
+# is up to date.
+# Note that this must be run every time, which is why the
+# target does not exist and is not created.
+# If you create a file called "config" you will break things.
+config:
+	@cd ../sys; ${MAKE} all
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+COMMON=../common
+${COMMON}/${RAND}.o: ${COMMON}/${RAND}.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} ${RAND}.f
+
+${COMMON}/print_results.o: ${COMMON}/print_results.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} print_results.f
+
+${COMMON}/c_print_results.o: ${COMMON}/c_print_results.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_print_results.c
+
+${COMMON}/timers.o: ${COMMON}/timers.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} timers.f
+
+${COMMON}/c_timers.o: ${COMMON}/c_timers.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_timers.c
+
+${COMMON}/wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} ${MACHINE} -o wtime.o ${COMMON}/${WTIME}
+# Adding ROI hooks
+${COMMON}/hooks.o: ${COMMON}/hooks.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} hooks.c m5op_x86.S m5_mmap.c -Wno-implicit-function-declaration -I ../include-gem5/
+# For most machines or CRAY or IBM
+#	cd ${COMMON}; ${CCOMPILE} ${MACHINE} ${COMMON}/wtime.c
+# For a precise timer on an SGI Power Challenge, try:
+#	cd ${COMMON}; ${CCOMPILE} -o wtime.o ${COMMON}/wtime_sgi64.c
+
+${COMMON}/c_wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} -o c_wtime.o ${COMMON}/${WTIME}
+
+
+# Normally setparams updates npbparams.h only if the settings (CLASS)
+# have changed. However, we also want to update if the compile options
+# may have changed (set in ../config/make.def).
+npbparams.h: ../config/make.def
+	@ echo make.def modified. Rebuilding npbparams.h just in case
+	rm -f npbparams.h
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+# So that "make benchmark-name" works
+${BENCHMARK}:  default
+${BENCHMARKU}: default
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_header b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_header
new file mode 100755
index 0000000..eefb7ee
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_header
@@ -0,0 +1,6 @@
+echo '   ============================================'
+echo '   =      NAS PARALLEL BENCHMARKS 3.3         ='
+echo '   =      OpenMP Versions                     ='
+echo '   =      F77/C                               ='
+echo '   ============================================'
+echo ''
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_instructions b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_instructions
new file mode 100755
index 0000000..ccba261
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/print_instructions
@@ -0,0 +1,19 @@
+echo ''
+echo '   To make a NAS benchmark type '
+echo ''
+echo '         make <benchmark-name> CLASS=<class>'
+echo ''
+echo '   where <benchmark-name> is "bt", "cg", "ep", "ft", "is", "lu",'
+echo '                             "mg", "sp", "ua", or "dc"'
+echo '         <class>          is "S", "W", "A", "B", "C" or "D"'
+echo ''
+echo '   To make a set of benchmarks, create the file config/suite.def'
+echo '   according to the instructions in config/suite.def.template and type'
+echo ''
+echo '         make suite'
+echo ''
+echo ' ***************************************************************'
+echo ' * Remember to edit the file config/make.def for site specific *'
+echo ' * information as described in the README file                 *'
+echo ' ***************************************************************'
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/setparams.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/setparams.c
new file mode 100644
index 0000000..37eb0fb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/setparams.c
@@ -0,0 +1,1055 @@
+/* 
+ * This utility configures a NPB to be built for a specific class. 
+ * It creates a file "npbparams.h" 
+ * in the source directory. This file keeps state information about 
+ * which size of benchmark is currently being built (so that nothing
+ * if unnecessarily rebuilt) and defines (through PARAMETER statements)
+ * the number of nodes and class for which a benchmark is being built. 
+
+ * The utility takes 3 arguments: 
+ *       setparams benchmark-name class
+ *    benchmark-name is "sp", "bt", etc
+ *    class is the size of the benchmark
+ * These parameters are checked for the current benchmark. If they
+ * are invalid, this program prints a message and aborts. 
+ * If the parameters are ok, the current npbsize.h (actually just
+ * the first line) is read in. If the new parameters are the same as 
+ * the old, nothing is done, but an exit code is returned to force the
+ * user to specify (otherwise the make procedure succeeds but builds a
+ * binary of the wrong name).  Otherwise the file is rewritten. 
+ * Errors write a message (to stdout) and abort. 
+ * 
+ * This program makes use of two extra benchmark "classes"
+ * class "X" means an invalid specification. It is returned if
+ * there is an error parsing the config file. 
+ * class "U" is an external specification meaning "unknown class"
+ * 
+ * Unfortunately everything has to be case sensitive. This is
+ * because we can always convert lower to upper or v.v. but
+ * can't feed this information back to the makefile, so typing
+ * make CLASS=a and make CLASS=A will produce different binaries.
+ *
+ * 
+ */
+
+#include <sys/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <time.h>
+
+/*
+ * This is the master version number for this set of 
+ * NPB benchmarks. It is in an obscure place so people
+ * won't accidentally change it. 
+ */
+
+#define VERSION "3.3.1"
+
+/* controls verbose output from setparams */
+/* #define VERBOSE */
+
+#define FILENAME "npbparams.h"
+#define DESC_LINE "c CLASS = %c\n"
+#define DEF_CLASS_LINE     "#define CLASS '%c'\n"
+#define FINDENT  "        "
+#define CONTINUE "     > "
+
+void get_info(char *argv[], int *typep, char *classp);
+void check_info(int type, char class);
+void read_info(int type, char *classp);
+void write_info(int type, char class);
+void write_sp_info(FILE *fp, char class);
+void write_bt_info(FILE *fp, char class);
+void write_dc_info(FILE *fp, char class);
+void write_lu_info(FILE *fp, char class);
+void write_mg_info(FILE *fp, char class);
+void write_cg_info(FILE *fp, char class);
+void write_ft_info(FILE *fp, char class);
+void write_ep_info(FILE *fp, char class);
+void write_is_info(FILE *fp, char class);
+void write_ua_info(FILE *fp, char class);
+void write_compiler_info(int type, FILE *fp);
+void write_convertdouble_info(int type, FILE *fp);
+void check_line(char *line, char *label, char *val);
+int  check_include_line(char *line, char *filename);
+void put_string(FILE *fp, char *name, char *val);
+void put_def_string(FILE *fp, char *name, char *val);
+void put_def_variable(FILE *fp, char *name, char *val);
+int ilog2(int i);
+double power(double base, int i);
+
+enum benchmark_types {SP, BT, LU, MG, FT, IS, EP, CG, UA, DC};
+
+int main(int argc, char *argv[])
+{
+  int type;
+  char class, class_old;
+  
+  if (argc != 3) {
+    printf("Usage: %s benchmark-name class\n", argv[0]);
+    exit(1);
+  }
+
+  /* Get command line arguments. Make sure they're ok. */
+  get_info(argv, &type, &class);
+  if (class != 'U') {
+#ifdef VERBOSE
+    printf("setparams: For benchmark %s: class = %c\n", 
+	   argv[1], class); 
+#endif
+    check_info(type, class);
+  }
+
+  /* Get old information. */
+  read_info(type, &class_old);
+  if (class != 'U') {
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams:     old settings: class = %c\n", 
+	     class_old); 
+#endif
+    }
+  } else {
+    printf("setparams:\n\
+  *********************************************************************\n\
+  * You must specify CLASS to build this benchmark                    *\n\
+  * For example, to build a class A benchmark, type                   *\n\
+  *       make {benchmark-name} CLASS=A                               *\n\
+  *********************************************************************\n\n"); 
+
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams: Previous settings were CLASS=%c \n", class_old); 
+#endif
+    }
+    exit(1); /* exit on class==U */
+  }
+
+  /* Write out new information if it's different. */
+  if (class != class_old) {
+#ifdef VERBOSE
+    printf("setparams: Writing %s\n", FILENAME); 
+#endif
+    write_info(type, class);
+  } else {
+#ifdef VERBOSE
+    printf("setparams: Settings unchanged. %s unmodified\n", FILENAME); 
+#endif
+  }
+
+  return 0;
+}
+
+
+/*
+ *  get_info(): Get parameters from command line 
+ */
+
+void get_info(char *argv[], int *typep, char *classp) 
+{
+
+  *classp = *argv[2];
+
+  if      (!strcmp(argv[1], "sp") || !strcmp(argv[1], "SP")) *typep = SP;
+  else if (!strcmp(argv[1], "bt") || !strcmp(argv[1], "BT")) *typep = BT;
+  else if (!strcmp(argv[1], "ft") || !strcmp(argv[1], "FT")) *typep = FT;
+  else if (!strcmp(argv[1], "lu") || !strcmp(argv[1], "LU")) *typep = LU;
+  else if (!strcmp(argv[1], "mg") || !strcmp(argv[1], "MG")) *typep = MG;
+  else if (!strcmp(argv[1], "is") || !strcmp(argv[1], "IS")) *typep = IS;
+  else if (!strcmp(argv[1], "ep") || !strcmp(argv[1], "EP")) *typep = EP;
+  else if (!strcmp(argv[1], "cg") || !strcmp(argv[1], "CG")) *typep = CG;
+  else if (!strcmp(argv[1], "ua") || !strcmp(argv[1], "UA")) *typep = UA;
+  else if (!strcmp(argv[1], "dc") || !strcmp(argv[1], "DC")) *typep = DC;
+  else {
+    printf("setparams: Error: unknown benchmark type %s\n", argv[1]);
+    exit(1);
+  }
+}
+
+/*
+ *  check_info(): Make sure command line data is ok for this benchmark 
+ */
+
+void check_info(int type, char class) 
+{
+
+  /* check class */
+  if (class != 'S' && 
+      class != 'W' && 
+      class != 'A' && 
+      class != 'B' && 
+      class != 'C' && 
+      class != 'D' && 
+      class != 'E') {
+    printf("setparams: Unknown benchmark class %c\n", class); 
+    printf("setparams: Allowed classes are \"S\", \"W\", and \"A\" through \"E\"\n");
+    exit(1);
+  }
+
+  if (class == 'E' && (type == IS || type == UA || type == DC)) {
+    printf("setparams: Benchmark class %c not defined for IS, UA, or DC\n", class);
+    exit(1);
+  }
+  if ((class == 'C' || class == 'D') && type == DC) {
+    printf("setparams: Benchmark class %c not defined for DC\n", class);
+    exit(1);
+  }
+}
+
+
+/* 
+ * read_info(): Read previous information from file. 
+ *              Not an error if file doesn't exist, because this
+ *              may be the first time we're running. 
+ *              Assumes the first line of the file is in a special
+ *              format that we understand (since we wrote it). 
+ */
+
+void read_info(int type, char *classp)
+{
+  int nread;
+  FILE *fp;
+  fp = fopen(FILENAME, "r");
+  if (fp == NULL) {
+#ifdef VERBOSE
+    printf("setparams: INFO: configuration file %s does not exist (yet)\n", FILENAME); 
+#endif
+    goto abort;
+  }
+  
+  /* first line of file contains info (fortran), first two lines (C) */
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          nread = fscanf(fp, DESC_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      case IS:
+      case DC:
+          nread = fscanf(fp, DEF_CLASS_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      default:
+        /* never should have gotten this far with a bad name */
+        printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+        exit(1);
+  }
+
+  fclose(fp);
+
+
+  return;
+
+ abort:
+  *classp = 'X';
+  return;
+}
+
+
+/* 
+ * write_info(): Write new information to config file. 
+ *               First line is in a special format so we can read
+ *               it in again. Then comes a warning. The rest is all
+ *               specific to a particular benchmark. 
+ */
+
+void write_info(int type, char class) 
+{
+  FILE *fp;
+  fp = fopen(FILENAME, "w");
+  if (fp == NULL) {
+    printf("setparams: Can't open file %s for writing\n", FILENAME);
+    exit(1);
+  }
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          /* Write out the header */
+          fprintf(fp, DESC_LINE, class);
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+c  \n\
+c  \n\
+c  This file is generated automatically by the setparams utility.\n\
+c  It sets the number of processors and the class of the NPB\n\
+c  in this directory. Do not modify it by hand.\n\
+c  \n");
+
+          break;
+      case IS:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.   */\n\
+   \n");
+          break;
+      case DC:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.\n\
+   This file provided for backward compatibility.\n\
+   It is not used in DC benchmark.   */\n\
+   \n");
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+  /* Now do benchmark-specific stuff */
+  switch(type) {
+  case SP:
+    write_sp_info(fp, class);
+    break;	      
+  case BT:	      
+    write_bt_info(fp, class);
+    break;
+ case DC:
+    write_dc_info(fp, class);
+    break;	      
+  case LU:	      
+    write_lu_info(fp, class);
+    break;	      
+  case MG:	      
+    write_mg_info(fp, class);
+    break;	      
+  case IS:	      
+    write_is_info(fp, class);  
+    break;	      
+  case FT:	      
+    write_ft_info(fp, class);
+    break;	      
+  case EP:	      
+    write_ep_info(fp, class);
+    break;	      
+  case CG:	      
+    write_cg_info(fp, class);
+    break;
+  case UA:	      
+    write_ua_info(fp, class);
+    break;
+  default:
+    printf("setparams: (Internal error): Unknown benchmark type %d\n", type);
+    exit(1);
+  }
+  write_convertdouble_info(type, fp);
+  write_compiler_info(type, fp);
+  fclose(fp);
+  return;
+}
+
+
+/* 
+ * write_sp_info(): Write SP specific info to config file
+ */
+
+void write_sp_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.015d0";   niter = 100; }
+  else if (class == 'W') { problem_size = 36;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'B') { problem_size = 102; dt = "0.001d0";   niter = 400; }
+  else if (class == 'C') { problem_size = 162; dt = "0.00067d0"; niter = 400; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00030d0"; niter = 500; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.0001d0"; niter = 500; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_bt_info(): Write BT specific info to config file
+ */
+
+void write_bt_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.010d0";   niter = 60; }
+  else if (class == 'W') { problem_size = 24;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'B') { problem_size = 102; dt = "0.0003d0";  niter = 200; }
+  else if (class == 'C') { problem_size = 162; dt = "0.0001d0";  niter = 200; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00002d0";  niter = 250; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.4d-5";    niter = 250; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_dc_info(): Write DC specific info to config file
+ */
+
+
+void write_dc_info(FILE *fp, char class)
+{
+  long int input_tuples, attrnum;
+  if      (class == 'S') { input_tuples = 1000;     attrnum = 5; }
+  else if (class == 'W') { input_tuples = 100000;   attrnum = 10; }
+  else if (class == 'A') { input_tuples = 1000000;  attrnum = 15; }
+  else if (class == 'B') { input_tuples = 10000000; attrnum = 20; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "long long int input_tuples=%ld, attrnum=%ld;\n",
+              input_tuples, attrnum);
+}
+
+
+/* 
+ * write_lu_info(): Write LU specific info to config file
+ */
+
+void write_lu_info(FILE *fp, char class) 
+{
+  int isiz1, isiz2, itmax, inorm, problem_size;
+  char *dt_default;
+
+  if      (class == 'S') { problem_size = 12;  dt_default = "0.5d0"; itmax = 50; }
+  else if (class == 'W') { problem_size = 33;  dt_default = "1.5d-3"; itmax = 300; }
+  else if (class == 'A') { problem_size = 64;  dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'B') { problem_size = 102; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'C') { problem_size = 162; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'D') { problem_size = 408; dt_default = "1.0d0"; itmax = 300; }
+  else if (class == 'E') { problem_size = 1020; dt_default = "0.5d0"; itmax = 300; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  inorm = itmax;
+  isiz1 = problem_size;
+  isiz2 = problem_size;
+  
+
+  fprintf(fp, "\nc full problem size\n");
+  fprintf(fp, "%sinteger isiz1, isiz2, isiz3\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz1=%d, isiz2=%d, isiz3=%d)\n", 
+	       FINDENT, isiz1, isiz2, problem_size );
+
+  fprintf(fp, "\nc number of iterations and how often to print the norm\n");
+  fprintf(fp, "%sinteger itmax_default, inorm_default\n", FINDENT);
+  fprintf(fp, "%sparameter (itmax_default=%d, inorm_default=%d)\n", 
+	  FINDENT, itmax, inorm);
+
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt_default);
+  
+}
+
+/* 
+ * write_mg_info(): Write MG specific info to config file
+ */
+
+void write_mg_info(FILE *fp, char class) 
+{
+  int problem_size, nit, log2_size, lt_default, lm;
+  int ndim1, ndim2, ndim3;
+  if      (class == 'S') { problem_size = 32; nit = 4; }
+/*  else if (class == 'W') { problem_size = 64; nit = 40; }*/
+  else if (class == 'W') { problem_size = 128; nit = 4; }
+  else if (class == 'A') { problem_size = 256; nit = 4; }
+  else if (class == 'B') { problem_size = 256; nit = 20; }
+  else if (class == 'C') { problem_size = 512; nit = 20; }
+  else if (class == 'D') { problem_size = 1024; nit = 50; }
+  else if (class == 'E') { problem_size = 2048; nit = 50; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  log2_size = ilog2(problem_size);
+  /* lt is log of largest total dimension */
+  lt_default = log2_size;
+  /* log of log of maximum dimension on a node */
+  lm = log2_size;
+  ndim1 = lm;
+  ndim3 = log2_size;
+  ndim2 = log2_size;
+
+  fprintf(fp, "%sinteger nx_default, ny_default, nz_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx_default=%d, ny_default=%d, nz_default=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+  fprintf(fp, "%sinteger nit_default, lm, lt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nit_default=%d, lm = %d, lt_default=%d)\n", 
+	  FINDENT, nit, lm, lt_default);
+  fprintf(fp, "%sinteger debug_default\n", FINDENT);
+  fprintf(fp, "%sparameter (debug_default=%d)\n", FINDENT, 0);
+  fprintf(fp, "%sinteger ndim1, ndim2, ndim3\n", FINDENT);
+  fprintf(fp, "%sparameter (ndim1 = %d, ndim2 = %d, ndim3 = %d)\n", 
+	  FINDENT, ndim1, ndim2, ndim3);
+  fprintf(fp, "%sinteger%s one, nr, nv, ir\n", 
+          FINDENT, (problem_size > 1024)? "*8" : "");
+  fprintf(fp, "%sparameter (one=1)\n", FINDENT);
+}
+
+
+/* 
+ * write_is_info(): Write IS specific info to config file
+ */
+
+void write_is_info(FILE *fp, char class) 
+{
+  if( class != 'S' &&
+      class != 'W' &&
+      class != 'A' &&
+      class != 'B' &&
+      class != 'C' &&
+      class != 'D')
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+}
+
+
+/* 
+ * write_cg_info(): Write CG specific info to config file
+ */
+
+void write_cg_info(FILE *fp, char class) 
+{
+  int na,nonzer,niter;
+  char *shift,*rcond="1.0d-1";
+  char *shiftS="10.",
+       *shiftW="12.",
+       *shiftA="20.",
+       *shiftB="60.",
+       *shiftC="110.",
+       *shiftD="500.",
+       *shiftE="1.5d3";
+
+
+  if( class == 'S' )
+  { na=1400; nonzer=7; niter=15; shift=shiftS; }
+  else if( class == 'W' )
+  { na=7000; nonzer=8; niter=15; shift=shiftW; }
+  else if( class == 'A' )
+  { na=14000; nonzer=11; niter=15; shift=shiftA; }
+  else if( class == 'B' )
+  { na=75000; nonzer=13; niter=75; shift=shiftB; }
+  else if( class == 'C' )
+  { na=150000; nonzer=15; niter=75; shift=shiftC; }
+  else if( class == 'D' )
+  { na=1500000; nonzer=21; niter=100; shift=shiftD; }
+  else if( class == 'E' )
+  { na=9000000; nonzer=26; niter=100; shift=shiftE; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf( fp, "%sinteger            na, nonzer, niter\n", FINDENT );
+  fprintf( fp, "%sdouble precision   shift, rcond\n", FINDENT );
+  fprintf( fp, "%sparameter(  na=%d,\n", FINDENT, na );
+  fprintf( fp, "%s             nonzer=%d,\n", CONTINUE, nonzer );
+  fprintf( fp, "%s             niter=%d,\n", CONTINUE, niter );
+  fprintf( fp, "%s             shift=%s,\n", CONTINUE, shift );
+  fprintf( fp, "%s             rcond=%s )\n", CONTINUE, rcond );
+  
+}
+
+
+
+/* 
+ * write_ua_info(): Write UA specific info to config file
+ */
+
+void write_ua_info(FILE *fp, char class) 
+{
+  int lelt, lmor,refine_max, niter, nmxh, fre;
+  char *alpha;
+
+  fre = 5;
+  if( class == 'S' )
+  { lelt=250;lmor=11600;       refine_max=4;  niter=50;  nmxh=10; alpha="0.040d0"; }
+  else if( class == 'W' )
+  { lelt=700;lmor=26700;       refine_max=5;  niter=100; nmxh=10; alpha="0.060d0"; }
+  else if( class == 'A' )
+  { lelt=2400;lmor=92700;      refine_max=6;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'B' )
+  { lelt=8800;  lmor=334600;   refine_max=7;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'C' )
+  { lelt=33500; lmor=1262100;  refine_max=8;  niter=200; nmxh=10; alpha="0.067d0"; }
+  else if( class == 'D' )
+  { lelt=515000;lmor=19500000; refine_max=10; niter=250; nmxh=10; alpha="0.046d0"; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  
+  fprintf( fp, "%sinteger          lelt, lmor, refine_max, fre_default\n", FINDENT );
+  fprintf( fp, "%sinteger          niter_default, nmxh_default\n", FINDENT );
+  fprintf( fp, "%scharacter        class_default\n", FINDENT );
+  fprintf( fp, "%sdouble precision alpha_default\n", FINDENT );
+  fprintf( fp, "%sparameter(  lelt=%d,\n", FINDENT, lelt );
+  fprintf( fp, "%s            lmor=%d,\n", CONTINUE, lmor );
+  fprintf( fp, "%s             refine_max=%d,\n", CONTINUE, refine_max );
+  fprintf( fp, "%s             fre_default=%d,\n", CONTINUE, fre );
+  fprintf( fp, "%s             niter_default=%d,\n", CONTINUE, niter );
+  fprintf( fp, "%s             nmxh_default=%d,\n", CONTINUE, nmxh );
+  fprintf( fp, "%s             class_default=\"%c\",\n", CONTINUE, class );
+  fprintf( fp, "%s             alpha_default=%s )\n", CONTINUE, alpha );
+  
+}
+
+
+/* 
+ * write_ft_info(): Write FT specific info to config file
+ */
+
+void write_ft_info(FILE *fp, char class) 
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int nx, ny, nz, maxdim, niter;
+  if      (class == 'S') { nx = 64; ny = 64; nz = 64; niter = 6;}
+  else if (class == 'W') { nx = 128; ny = 128; nz = 32; niter = 6;}
+  else if (class == 'A') { nx = 256; ny = 256; nz = 128; niter = 6;}
+  else if (class == 'B') { nx = 512; ny = 256; nz = 256; niter =20;}
+  else if (class == 'C') { nx = 512; ny = 512; nz = 512; niter =20;}
+  else if (class == 'D') { nx = 2048; ny = 1024; nz = 1024; niter =25;}
+  else if (class == 'E') { nx = 4096; ny = 2048; nz = 2048; niter =25;}
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  maxdim = nx;
+  if (ny > maxdim) maxdim = ny;
+  if (nz > maxdim) maxdim = nz;
+  fprintf(fp, "%sinteger nx, ny, nz, maxdim, niter_default\n", FINDENT);
+  fprintf(fp, "%sinteger%s ntotal, nxp, nyp, ntotalp\n", FINDENT,
+          (nx > 1024)? "*8" : "");
+  fprintf(fp, "%sparameter (nx=%d, ny=%d, nz=%d, maxdim=%d)\n", 
+          FINDENT, nx, ny, nz, maxdim);
+  fprintf(fp, "%sparameter (niter_default=%d)\n", FINDENT, niter);
+  fprintf(fp, "%sparameter (nxp=nx+1, nyp=ny)\n", FINDENT);
+  fprintf(fp, "%sparameter (ntotal=nx*nyp*nz)\n", FINDENT);
+  fprintf(fp, "%sparameter (ntotalp=nxp*nyp*nz)\n", FINDENT);
+
+}
+
+/*
+ * write_ep_info(): Write EP specific info to config file
+ */
+
+void write_ep_info(FILE *fp, char class)
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int m;
+  if      (class == 'S') { m = 24; }
+  else if (class == 'W') { m = 25; }
+  else if (class == 'A') { m = 28; }
+  else if (class == 'B') { m = 30; }
+  else if (class == 'C') { m = 32; }
+  else if (class == 'D') { m = 36; }
+  else if (class == 'E') { m = 40; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+
+  fprintf(fp, "%scharacter class\n",FINDENT);
+  fprintf(fp, "%sparameter (class =\'%c\')\n",
+                  FINDENT, class);
+  fprintf(fp, "%sinteger m\n", FINDENT);
+  fprintf(fp, "%sparameter (m=%d)\n", FINDENT, m);
+}
+
+
+/* 
+ * This is a gross hack to allow the benchmarks to 
+ * print out how they were compiled. Various other ways
+ * of doing this have been tried and they all fail on
+ * some machine - due to a broken "make" program, or
+ * F77 limitations, of whatever. Hopefully this will
+ * always work because it uses very portable C. Unfortunately
+ * it relies on parsing the make.def file - YUK. 
+ * If your machine doesn't have <string.h> or <ctype.h>, happy hacking!
+ * 
+ */
+
+#define VERBOSE
+#define LL 400
+#define DEFFILE "../config/make.def"
+#define DEFAULT_MESSAGE "(none)"
+FILE *deffile;
+void write_compiler_info(int type, FILE *fp)
+{
+  char line[LL];
+  char f77[LL], flink[LL], f_lib[LL], f_inc[LL], fflags[LL], flinkflags[LL];
+  char compiletime[LL], randfile[LL];
+  char cc[LL], cflags[LL], clink[LL], clinkflags[LL],
+       c_lib[LL], c_inc[LL];
+  struct tm *tmp;
+  time_t t;
+  deffile = fopen(DEFFILE, "r");
+  if (deffile == NULL) {
+    printf("\n\
+setparams: File %s doesn't exist. To build the NAS benchmarks\n\
+           you need to create is according to the instructions\n\
+           in the README in the main directory and comments in \n\
+           the file config/make.def.template\n", DEFFILE);
+    exit(1);
+  }
+  strcpy(f77, DEFAULT_MESSAGE);
+  strcpy(flink, DEFAULT_MESSAGE);
+  strcpy(f_lib, DEFAULT_MESSAGE);
+  strcpy(f_inc, DEFAULT_MESSAGE);
+  strcpy(fflags, DEFAULT_MESSAGE);
+  strcpy(flinkflags, DEFAULT_MESSAGE);
+  strcpy(randfile, DEFAULT_MESSAGE);
+  strcpy(cc, DEFAULT_MESSAGE);
+  strcpy(cflags, DEFAULT_MESSAGE);
+  strcpy(clink, DEFAULT_MESSAGE);
+  strcpy(clinkflags, DEFAULT_MESSAGE);
+  strcpy(c_lib, DEFAULT_MESSAGE);
+  strcpy(c_inc, DEFAULT_MESSAGE);
+
+  while (fgets(line, LL, deffile) != NULL) {
+    if (*line == '#') continue;
+    /* yes, this is inefficient. but it's simple! */
+    check_line(line, "F77", f77);
+    check_line(line, "FLINK", flink);
+    check_line(line, "F_LIB", f_lib);
+    check_line(line, "F_INC", f_inc);
+    check_line(line, "FFLAGS", fflags);
+    check_line(line, "FLINKFLAGS", flinkflags);
+    check_line(line, "RAND", randfile);
+    check_line(line, "CC", cc);
+    check_line(line, "CFLAGS", cflags);
+    check_line(line, "CLINK", clink);
+    check_line(line, "CLINKFLAGS", clinkflags);
+    check_line(line, "C_LIB", c_lib);
+    check_line(line, "C_INC", c_inc);
+  }
+
+  
+  (void) time(&t);
+  tmp = localtime(&t);
+  (void) strftime(compiletime, (size_t)LL, "%d %b %Y", tmp);
+
+
+  switch(type) {
+      case FT:
+      case SP:
+      case BT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          put_string(fp, "compiletime", compiletime);
+          put_string(fp, "npbversion", VERSION);
+          put_string(fp, "cs1", f77);
+          put_string(fp, "cs2", flink);
+          put_string(fp, "cs3", f_lib);
+          put_string(fp, "cs4", f_inc);
+          put_string(fp, "cs5", fflags);
+          put_string(fp, "cs6", flinkflags);
+	  put_string(fp, "cs7", randfile);
+          break;
+      case IS:
+      case DC:
+          put_def_string(fp, "COMPILETIME", compiletime);
+          put_def_string(fp, "NPBVERSION", VERSION);
+          put_def_string(fp, "CC", cc);
+          put_def_string(fp, "CFLAGS", cflags);
+          put_def_string(fp, "CLINK", clink);
+          put_def_string(fp, "CLINKFLAGS", clinkflags);
+          put_def_string(fp, "C_LIB", c_lib);
+          put_def_string(fp, "C_INC", c_inc);
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+}
+
+void check_line(char *line, char *label, char *val)
+{
+  char *original_line;
+  int n;
+  original_line = line;
+  /* compare beginning of line and label */
+  while (*label != '\0' && *line == *label) {
+    line++; label++; 
+  }
+  /* if *label is not EOS, we must have had a mismatch */
+  if (*label != '\0') return;
+  /* if *line is not a space, actual label is longer than test label */
+  if (!isspace(*line) && *line != '=') return ; 
+  /* skip over white space */
+  while (isspace(*line)) line++;
+  /* next char should be '=' */
+  if (*line != '=') return;
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return;
+  /* finally we've come to the value */
+  strcpy(val, line);
+  /* chop off the newline at the end */
+  n = strlen(val)-1;
+  if (n >= 0 && val[n] == '\n')
+    val[n--] = '\0';
+  if (n >= 0 && val[n] == '\r')
+    val[n--] = '\0';
+  /* treat continuation */
+  while (val[n] == '\\' && fgets(original_line, LL, deffile)) {
+     line = original_line;
+     while (isspace(*line)) line++;
+     if (isspace(*original_line)) val[n++] = ' ';
+     while (*line && *line != '\n' && *line != '\r' && n < LL-1)
+       val[n++] = *line++;
+     val[n] = '\0';
+     n--;
+  }
+/*  if (val[n] == '\\') {
+    printf("\n\
+setparams: Error in file make.def. Because of the way in which\n\
+           command line arguments are incorporated into the\n\
+           executable benchmark, you can't have any continued\n\
+           lines in the file make.def, that is, lines ending\n\
+           with the character \"\\\". Although it may be ugly, \n\
+           you should be able to reformat without continuation\n\
+           lines. The offending line is\n\
+  %s\n", original_line);
+    exit(1);
+  } */
+}
+
+int check_include_line(char *line, char *filename)
+{
+  char *include_string = "include";
+  /* compare beginning of line and "include" */
+  while (*include_string != '\0' && *line == *include_string) {
+    line++; include_string++; 
+  }
+  /* if *include_string is not EOS, we must have had a mismatch */
+  if (*include_string != '\0') return(0);
+  /* if *line is not a space, first word is not "include" */
+  if (!isspace(*line)) return(0); 
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return(0);
+  /* next keyword should be name of include file in *filename */
+  while (*filename != '\0' && *line == *filename) {
+    line++; filename++; 
+  }  
+  if (*filename != '\0' || 
+      (*line != ' ' && *line != '\0' && *line !='\n')) return(0);
+  else return(1);
+}
+
+
+#define MAXL 46
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "%scharacter %s*%d\n", FINDENT, name, len);
+  fprintf(fp, "%sparameter (%s=\'%s\')\n", FINDENT, name, val);
+}
+
+/* need to escape quote (") in val */
+int fix_string_quote(char *val, char *newval, int maxl)
+{
+  int len;
+  int i, j;
+  len = strlen(val);
+  i = j = 0;
+  while (i < len && j < maxl) {
+    if (val[i] == '"')
+      newval[j++] = '\\';
+    if (j < maxl)
+      newval[j++] = val[i++];
+  }
+  newval[j] = '\0';
+  return j;
+}
+
+/* NOTE: is the ... stuff necessary in C? */
+void put_def_string(FILE *fp, char *name, char *val0)
+{
+  int len;
+  char val[MAXL+3];
+  len = fix_string_quote(val0, val, MAXL+2);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s \"%s\"\n", name, val);
+}
+
+void put_def_variable(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s %s\n", name, val);
+}
+
+
+
+#if 0
+
+/* this version allows arbitrarily long lines but 
+ * some compilers don't like that and they're rarely
+ * useful 
+ */
+
+#define LINELEN 65
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len, nlines, pos, i;
+  char line[100];
+  len = strlen(val);
+  nlines = len/LINELEN;
+  if (nlines*LINELEN < len) nlines++;
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, nlines*LINELEN, name);
+  fprintf(fp, "%sparameter (%s = \n", FINDENT, name);
+  for (i = 0; i < nlines; i++) {
+    pos = i*LINELEN;
+    if (i == 0) fprintf(fp, "%s\'", CONTINUE);
+    else        fprintf(fp, "%s", CONTINUE);
+    /* number should be same as LINELEN */
+    fprintf(fp, "%.65s", val+pos);
+    if (i == nlines-1) fprintf(fp, "\')\n");
+    else             fprintf(fp, "\n");
+  }
+}
+
+#endif
+
+
+/* integer log base two. Return error is argument isn't
+ * a power of two or is less than or equal to zero 
+ */
+
+int ilog2(int i)
+{
+  int log2;
+  int exp2 = 1;
+  if (i <= 0) return(-1);
+
+  for (log2 = 0; log2 < 30; log2++) {
+    if (exp2 == i) return(log2);
+    if (exp2 > i) break;
+    exp2 *= 2;
+  }
+  return(-1);
+}
+
+
+/* Power function. We could use pow from the math library, but then
+ * we would have to insist on always linking with the math library, just
+ * for this function. Since we only need pow with integer exponents,
+ * we'll code it ourselves here.
+ */
+
+double power(double base, int i)
+{
+  double x;
+
+  if (i==0) return (1.0);
+  else if (i<0) {
+    base = 1.0/base;
+    i = -i;
+  }
+  x = 1.0;
+  while (i>0) {
+    x *=base;
+    i--;
+  }
+  return (x);
+}
+    
+
+void write_convertdouble_info(int type, FILE *fp)
+{
+  switch(type) {
+  case SP:
+  case BT:
+  case LU:
+  case FT:
+  case MG:
+  case EP:
+  case CG:
+  case UA:
+    fprintf(fp, "%slogical  convertdouble\n", FINDENT);
+#ifdef CONVERTDOUBLE
+    fprintf(fp, "%sparameter (convertdouble = .true.)\n", FINDENT);
+#else
+    fprintf(fp, "%sparameter (convertdouble = .false.)\n", FINDENT);
+#endif
+    break;
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/suite.awk b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/suite.awk
new file mode 100644
index 0000000..461adab
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/sys/suite.awk
@@ -0,0 +1,10 @@
+BEGIN { SMAKE = "make" } {
+  if ($1 !~ /^#/ &&  NF > 1) {
+    printf "cd `echo %s|tr '[a-z]' '[A-Z]'`; %s clean;", $1, SMAKE;
+    printf "%s CLASS=%s", SMAKE, $2;
+    if (NF > 2) {
+      printf " VERSION=%s", $3;
+    }
+    printf "; cd ..\n";
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/Makefile
new file mode 100644
index 0000000..24f6fea
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/Makefile
@@ -0,0 +1,60 @@
+SHELL=/bin/sh
+BENCHMARK=bt
+BENCHMARKU=BT
+VEC=
+
+include ../config/make.def
+
+
+OBJS = bt.o  initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o  rhs.o      \
+       x_solve$(VEC).o y_solve$(VEC).o solve_subs.o  \
+       z_solve$(VEC).o add.o error.o verify.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# header.h: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o:
+	${FCOMPILE} $<
+
+
+bt.o:             bt.f  header.h npbparams.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+rhs.o:            rhs.f  header.h npbparams.h
+x_solve$(VEC).o:  x_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+y_solve$(VEC).o:  y_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+z_solve$(VEC).o:  z_solve$(VEC).f  header.h work_lhs$(VEC).h npbparams.h
+solve_subs.o:     solve_subs.f  npbparams.h
+add.o:            add.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h
+verify.o:         verify.f  header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f  npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/add.f
new file mode 100644
index 0000000..f859822
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/add.f
@@ -0,0 +1,30 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     addition of update to the vector u
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m
+
+      if (timeron) call timer_start(t_add)
+      do     k = 1, grid_points(3)-2
+         do     j = 1, grid_points(2)-2
+            do     i = 1, grid_points(1)-2
+               do    m = 1, 5
+                  u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_add)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/adi.f
new file mode 100644
index 0000000..4b45494
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/adi.f
@@ -0,0 +1,21 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      call compute_rhs
+
+      call x_solve
+
+      call y_solve
+
+      call z_solve
+
+      call add
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/bt.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/bt.f
new file mode 100644
index 0000000..f9629ae
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/bt.f
@@ -0,0 +1,210 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   B T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB BT code.               !
+!    Refer to NAS Technical Reports 95-020 and 99-011 for details.        !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: R. Van der Wijngaart
+c          T. Harris
+c          M. Yarrow
+c          H. Jin
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program BT
+c---------------------------------------------------------------------
+
+       include  'header.h'
+      
+       integer i, niter, step, fstatus
+       double precision navg, mflops, n3
+
+       external timer_read
+       double precision tmax, timer_read, t, trecs(t_last)
+       logical verified
+       character class
+       character        t_names(t_last)*8
+
+c---------------------------------------------------------------------
+c      Root node reads input file (if it exists) else takes
+c      defaults from parameters
+c---------------------------------------------------------------------
+          
+       open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+       if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_rdis2) = 'redist2'
+         t_names(t_add) = 'add'
+         close(2)
+       else
+         timeron = .false.
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputbt.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233) 
+ 233     format(' Reading from input file inputbt.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234) 
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputbt.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+       write(*, *)
+
+ 1000  format(//, ' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >            ' - BT Benchmark',/)
+ 1001  format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002  format(' Iterations: ', i4, '    dt: ', F10.6)
+
+       if ( (grid_points(1) .gt. IMAX) .or.
+     >      (grid_points(2) .gt. JMAX) .or.
+     >      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+
+
+       call set_constants
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call initialize
+
+       call exact_rhs
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or. 
+     >        step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       tmax = timer_read(1)
+       
+       call verify(niter, class, verified)
+
+       n3 = 1.0d0*grid_points(1)*grid_points(2)*grid_points(3)
+       navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.0
+       if( tmax .ne. 0. ) then
+          mflops = 1.0e-6*float(niter)*
+     >  (3478.8*n3-17655.7*navg**2+28023.7*navg)
+     >  / tmax
+       else
+          mflops = 0.0
+       endif
+       call print_results('BT', class, grid_points(1), 
+     >  grid_points(2), grid_points(3), niter,
+     >  tmax, mflops, '          floating point', 
+     >  verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >  cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+
+       if (tmax .eq. 0.0) tmax = 1.0
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_zsolve) then
+             t = trecs(t_zsolve) - trecs(t_rdis1) - trecs(t_rdis2)
+             write(*,820) 'sub-zsol', t, t*100./tmax
+          elseif (i.eq.t_rdis2) then
+             t = trecs(t_rdis1) + trecs(t_rdis2)
+             write(*,820) 'redist', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/error.f
new file mode 100644
index 0000000..1a3491a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/error.f
@@ -0,0 +1,87 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function computes the norm of the difference between the
+c     computed solution and the exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m, d
+      double precision xi, eta, zeta, u_exact(5), rms(5), add
+
+      do m = 1, 5 
+         rms(m) = 0.0d0
+      enddo
+
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+               call exact_solution(xi, eta, zeta, u_exact)
+
+               do m = 1, 5
+                  add = u(m,i,j,k)-u_exact(m)
+                  rms(m) = rms(m) + add*add
+               enddo
+            enddo
+          enddo
+       enddo
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo
+         rms(m) = dsqrt(rms(m))
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, d, m
+      double precision rms(5), add
+
+      do m = 1, 5
+         rms(m) = 0.0d0
+      enddo 
+
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  add = rhs(m,i,j,k)
+                  rms(m) = rms(m) + add*add
+               enddo 
+            enddo 
+         enddo 
+      enddo 
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo 
+         rms(m) = dsqrt(rms(m))
+      enddo 
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_rhs.f
new file mode 100644
index 0000000..2f6c38b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_rhs.f
@@ -0,0 +1,341 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dtemp(5), xi, eta, zeta, dtpp
+      integer m, i, j, k, ip1, im1, jp1, jm1, km1, kp1
+
+c---------------------------------------------------------------------
+c     initialize                                  
+c---------------------------------------------------------------------
+      do k= 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  forcing(m,i,j,k) = 0.0d0
+               enddo
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     xi-direction flux differences                      
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         zeta = dble(k) * dnzm1
+         do j = 1, grid_points(2)-2
+            eta = dble(j) * dnym1
+
+            do i=0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(i,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0 / dtemp(1)
+
+               do m = 2, 5
+                  buf(i,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(i)   = buf(i,2) * buf(i,2)
+               buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                 buf(i,4) * buf(i,4) 
+               q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                 buf(i,4)*ue(i,4))
+
+            enddo
+               
+            do i = 1, grid_points(1)-2
+               im1 = i-1
+               ip1 = i+1
+
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (
+     >                 (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                 (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                 buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                         
+c---------------------------------------------------------------------
+
+            do m = 1, 5
+               i = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+               i = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                    4.0d0*ue(i+1,m) +       ue(i+2,m))
+            enddo
+
+            do i = 3, grid_points(1)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               i = grid_points(1)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+               i = grid_points(1)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+            enddo
+
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     eta-direction flux differences             
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2          
+         zeta = dble(k) * dnzm1
+         do i=1, grid_points(1)-2
+            xi = dble(i) * dnxm1
+
+            do j=0, grid_points(2)-1
+               eta = dble(j) * dnym1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5 
+                  ue(j,m) = dtemp(m)
+               enddo
+                  
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(j,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(j)   = buf(j,3) * buf(j,3)
+               buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                 buf(j,4) * buf(j,4)
+               q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                 buf(j,4)*ue(j,4))
+            enddo
+
+            do j = 1, grid_points(2)-2
+               jm1 = j-1
+               jp1 = j+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                 dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(
+     >                 ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                 dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(
+     >                 (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                 (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                 yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                 dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(
+     >                 ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                 yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                 dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(
+     >                 buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                 buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                 buf(jm1,1))+
+     >                 yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                 yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                 dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                      
+c---------------------------------------------------------------------
+            do m = 1, 5
+               j = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+               j = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                    4.0d0*ue(j+1,m) +       ue(j+2,m))
+            enddo
+
+            do j = 3, grid_points(2)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               j = grid_points(2)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+               j = grid_points(2)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+            enddo
+
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     zeta-direction flux differences                      
+c---------------------------------------------------------------------
+      do j=1, grid_points(2)-2
+         eta = dble(j) * dnym1
+         do i = 1, grid_points(1)-2
+            xi = dble(i) * dnxm1
+
+            do k=0, grid_points(3)-1
+               zeta = dble(k) * dnzm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(k,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(k,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(k)   = buf(k,4) * buf(k,4)
+               buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + 
+     >                 buf(k,3) * buf(k,3)
+               q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                 buf(k,4)*ue(k,4))
+            enddo
+
+            do k=1, grid_points(3)-2
+               km1 = k-1
+               kp1 = k+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (
+     >                 (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                 (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                 zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                 dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                 +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+            enddo
+
+c---------------------------------------------------------------------
+c     Fourth-order dissipation                        
+c---------------------------------------------------------------------
+            do m = 1, 5
+               k = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+               k = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                    4.0d0*ue(k+1,m) +       ue(k+2,m))
+            enddo
+
+            do k = 3, grid_points(3)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               k = grid_points(3)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+               k = grid_points(3)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+            enddo
+
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_solution.f
new file mode 100644
index 0000000..b093b46
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/exact_solution.f
@@ -0,0 +1,29 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision  xi, eta, zeta, dtemp(5)
+      integer m
+
+      do m = 1, 5
+         dtemp(m) =  ce(m,1) +
+     >     xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >     eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >     zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >     zeta*ce(m,13))))
+      enddo
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/header.h
new file mode 100644
index 0000000..e771ef0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/header.h
@@ -0,0 +1,105 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  header.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      maxcells:      the square root of the maximum number of processors
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter        (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           grid_points(3)
+      double precision  elapsed_time
+      logical           timeron
+      common /global/   elapsed_time, grid_points, timeron
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX, IMAXP, JMAXP
+
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+c
+c   to improve cache performance, grid dimensions padded by 1 
+c   for even number sizes only.
+c
+      double precision 
+     >   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1)
+      common /fields/  u, us, vs, ws, qs, rho_i, square, 
+     >                 rhs, forcing
+
+      double precision cuf(0:problem_size),   q  (0:problem_size),
+     >                 ue (0:problem_size,5), buf(0:problem_size,5)
+      common /work_1d/ cuf, q, ue, buf
+      
+
+c-----------------------------------------------------------------------
+c   Timer constants
+c-----------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_xsolve,t_ysolve,t_zsolve,
+     >        t_rdis1,t_rdis2,t_add,
+     >        t_rhs,t_last,t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_rdis2 = 10)
+      parameter (t_add = 11)
+      parameter (t_last = 11)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/initialize.f
new file mode 100644
index 0000000..b3c98fd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/initialize.f
@@ -0,0 +1,228 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This subroutine initializes the field variable u using 
+c     tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      integer i, j, k, m, ix, iy, iz
+      double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >     Pzeta, temp(5)
+
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  u(m,i,j,k) = 1.0
+               end do
+            end do
+         end do
+      end do
+c---------------------------------------------------------------------
+
+
+
+c---------------------------------------------------------------------
+c     first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+                  
+               do ix = 1, 2
+                  call exact_solution(dble(ix-1), eta, zeta, 
+     >                    Pface(1,1,ix))
+               enddo
+
+               do iy = 1, 2
+                  call exact_solution(xi, dble(iy-1) , zeta, 
+     >                    Pface(1,2,iy))
+               enddo
+
+               do iz = 1, 2
+                  call exact_solution(xi, eta, dble(iz-1),   
+     >                    Pface(1,3,iz))
+               enddo
+
+               do m = 1, 5
+                  Pxi   = xi   * Pface(m,1,2) + 
+     >                    (1.0d0-xi)   * Pface(m,1,1)
+                  Peta  = eta  * Pface(m,2,2) + 
+     >                    (1.0d0-eta)  * Pface(m,2,1)
+                  Pzeta = zeta * Pface(m,3,2) + 
+     >                    (1.0d0-zeta) * Pface(m,3,1)
+                     
+                  u(m,i,j,k) = Pxi + Peta + Pzeta - 
+     >                    Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                    Pxi*Peta*Pzeta
+
+               enddo
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     west face                                                  
+c---------------------------------------------------------------------
+      i = 0
+      xi = 0.0d0
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     east face                                                      
+c---------------------------------------------------------------------
+
+      i = grid_points(1)-1
+      xi = 1.0d0
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do j = 0, grid_points(2)-1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     south face                                                 
+c---------------------------------------------------------------------
+      j = 0
+      eta = 0.0d0
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do i = 0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+
+c---------------------------------------------------------------------
+c     north face                                    
+c---------------------------------------------------------------------
+      j = grid_points(2)-1
+      eta = 1.0d0
+      do k = 0, grid_points(3)-1
+         zeta = dble(k) * dnzm1
+         do i = 0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     bottom face                                       
+c---------------------------------------------------------------------
+      k = 0
+      zeta = 0.0d0
+      do j = 0, grid_points(2)-1
+         eta = dble(j) * dnym1
+         do i =0, grid_points(1)-1
+            xi = dble(i) *dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     top face     
+c---------------------------------------------------------------------
+      k = grid_points(3)-1
+      zeta = 1.0d0
+      do j = 0, grid_points(2)-1
+         eta = dble(j) * dnym1
+         do i =0, grid_points(1)-1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine lhsinit(lhs, size)
+      implicit none
+      integer size
+      double precision lhs(5,5,3,0:size)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      integer i, m, n
+
+      i = size
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c---------------------------------------------------------------------
+      do m = 1, 5
+         do n = 1, 5
+            lhs(m,n,1,0) = 0.0d0
+            lhs(m,n,2,0) = 0.0d0
+            lhs(m,n,3,0) = 0.0d0
+            lhs(m,n,1,i) = 0.0d0
+            lhs(m,n,2,i) = 0.0d0
+            lhs(m,n,3,i) = 0.0d0
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     next, set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+      do m = 1, 5
+         lhs(m,m,2,0) = 1.0d0
+         lhs(m,m,2,i) = 1.0d0
+      enddo
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/inputbt.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/inputbt.data.sample
new file mode 100644
index 0000000..d47ca91
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/inputbt.data.sample
@@ -0,0 +1,3 @@
+60       number of time steps
+0.01d0   dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
+12 12 12
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/rhs.f
new file mode 100644
index 0000000..df3142f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/rhs.f
@@ -0,0 +1,401 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, m
+      double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >     wijk, wp1, wm1
+
+
+      if (timeron) call timer_start(t_rhs)
+c---------------------------------------------------------------------
+c     compute the reciprocal of density, and the kinetic energy, 
+c     and the speed of sound.
+c---------------------------------------------------------------------
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               rho_inv = 1.0d0/u(1,i,j,k)
+               rho_i(i,j,k) = rho_inv
+               us(i,j,k) = u(2,i,j,k) * rho_inv
+               vs(i,j,k) = u(3,i,j,k) * rho_inv
+               ws(i,j,k) = u(4,i,j,k) * rho_inv
+               square(i,j,k)     = 0.5d0* (
+     >                 u(2,i,j,k)*u(2,i,j,k) + 
+     >                 u(3,i,j,k)*u(3,i,j,k) +
+     >                 u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+               qs(i,j,k) = square(i,j,k) * rho_inv
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole grid
+c including the boundary                   
+c---------------------------------------------------------------------
+
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  rhs(m,i,j,k) = forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+
+
+      if (timeron) call timer_start(t_rhsx)
+c---------------------------------------------------------------------
+c     compute xi-direction fluxes 
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               uijk = us(i,j,k)
+               up1  = us(i+1,j,k)
+               um1  = us(i-1,j,k)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 * 
+     >                 (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i-1,j,k)) -
+     >                 tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 * 
+     >                 (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i-1,j,k)) +
+     >                 xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                 tx2 * (u(2,i+1,j,k)*up1 - 
+     >                 u(2,i-1,j,k)*um1 +
+     >                 (u(5,i+1,j,k)- square(i+1,j,k)-
+     >                 u(5,i-1,j,k)+ square(i-1,j,k))*
+     >                 c2)
+
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 * 
+     >                 (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +
+     >                 u(3,i-1,j,k)) +
+     >                 xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +
+     >                 vs(i-1,j,k)) -
+     >                 tx2 * (u(3,i+1,j,k)*up1 - 
+     >                 u(3,i-1,j,k)*um1)
+
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 * 
+     >                 (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +
+     >                 u(4,i-1,j,k)) +
+     >                 xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +
+     >                 ws(i-1,j,k)) -
+     >                 tx2 * (u(4,i+1,j,k)*up1 - 
+     >                 u(4,i-1,j,k)*um1)
+
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 * 
+     >                 (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +
+     >                 u(5,i-1,j,k)) +
+     >                 xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +
+     >                 qs(i-1,j,k)) +
+     >                 xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                 um1*um1) +
+     >                 xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i-1,j,k)*rho_i(i-1,j,k)) -
+     >                 tx2 * ( (c1*u(5,i+1,j,k) - 
+     >                 c2*square(i+1,j,k))*up1 -
+     >                 (c1*u(5,i-1,j,k) - 
+     >                 c2*square(i-1,j,k))*um1 )
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            i = 1
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +
+     >                    u(m,i+2,j,k))
+            enddo
+
+            i = 2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+            enddo
+         enddo
+
+         do j = 1, grid_points(2)-2
+            do i = 3,grid_points(1)-4
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) + 
+     >                    u(m,i+2,j,k) )
+               enddo
+            enddo
+         enddo
+         
+         do j = 1, grid_points(2)-2
+            i = grid_points(1)-3
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+            enddo
+
+            i = grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+c---------------------------------------------------------------------
+c     compute eta-direction fluxes 
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               vijk = vs(i,j,k)
+               vp1  = vs(i,j+1,k)
+               vm1  = vs(i,j-1,k)
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 * 
+     >                 (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i,j-1,k)) -
+     >                 ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 * 
+     >                 (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i,j-1,k)) +
+     >                 yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) + 
+     >                 us(i,j-1,k)) -
+     >                 ty2 * (u(2,i,j+1,k)*vp1 - 
+     >                 u(2,i,j-1,k)*vm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 * 
+     >                 (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) + 
+     >                 u(3,i,j-1,k)) +
+     >                 yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                 ty2 * (u(3,i,j+1,k)*vp1 - 
+     >                 u(3,i,j-1,k)*vm1 +
+     >                 (u(5,i,j+1,k) - square(i,j+1,k) - 
+     >                 u(5,i,j-1,k) + square(i,j-1,k))
+     >                 *c2)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 * 
+     >                 (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) + 
+     >                 u(4,i,j-1,k)) +
+     >                 yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) + 
+     >                 ws(i,j-1,k)) -
+     >                 ty2 * (u(4,i,j+1,k)*vp1 - 
+     >                 u(4,i,j-1,k)*vm1)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 * 
+     >                 (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) + 
+     >                 u(5,i,j-1,k)) +
+     >                 yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) + 
+     >                 qs(i,j-1,k)) +
+     >                 yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                 vm1*vm1) +
+     >                 yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i,j-1,k)*rho_i(i,j-1,k)) -
+     >                 ty2 * ((c1*u(5,i,j+1,k) - 
+     >                 c2*square(i,j+1,k)) * vp1 -
+     >                 (c1*u(5,i,j-1,k) - 
+     >                 c2*square(i,j-1,k)) * vm1)
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+         j = 1
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +
+     >                    u(m,i,j+2,k))
+            enddo
+         enddo
+
+         j = 2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+            enddo
+         enddo
+
+         do j = 3, grid_points(2)-4
+            do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) + 
+     >                    u(m,i,j+2,k) )
+               enddo
+            enddo
+         enddo
+         
+         j = grid_points(2)-3
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+            enddo
+         enddo
+
+         j = grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+c---------------------------------------------------------------------
+c     compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               wijk = ws(i,j,k)
+               wp1  = ws(i,j,k+1)
+               wm1  = ws(i,j,k-1)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 * 
+     >                 (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) + 
+     >                 u(1,i,j,k-1)) -
+     >                 tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 * 
+     >                 (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) + 
+     >                 u(2,i,j,k-1)) +
+     >                 zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) + 
+     >                 us(i,j,k-1)) -
+     >                 tz2 * (u(2,i,j,k+1)*wp1 - 
+     >                 u(2,i,j,k-1)*wm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 * 
+     >                 (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) + 
+     >                 u(3,i,j,k-1)) +
+     >                 zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) + 
+     >                 vs(i,j,k-1)) -
+     >                 tz2 * (u(3,i,j,k+1)*wp1 - 
+     >                 u(3,i,j,k-1)*wm1)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 * 
+     >                 (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) + 
+     >                 u(4,i,j,k-1)) +
+     >                 zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                 tz2 * (u(4,i,j,k+1)*wp1 - 
+     >                 u(4,i,j,k-1)*wm1 +
+     >                 (u(5,i,j,k+1) - square(i,j,k+1) - 
+     >                 u(5,i,j,k-1) + square(i,j,k-1))
+     >                 *c2)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 * 
+     >                 (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) + 
+     >                 u(5,i,j,k-1)) +
+     >                 zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) + 
+     >                 qs(i,j,k-1)) +
+     >                 zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                 wm1*wm1) +
+     >                 zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) - 
+     >                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                 u(5,i,j,k-1)*rho_i(i,j,k-1)) -
+     >                 tz2 * ( (c1*u(5,i,j,k+1) - 
+     >                 c2*square(i,j,k+1))*wp1 -
+     >                 (c1*u(5,i,j,k-1) - 
+     >                 c2*square(i,j,k-1))*wm1)
+            enddo
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c     add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+      k = 1
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +
+     >                    u(m,i,j,k+2))
+            enddo
+         enddo
+      enddo
+
+      k = 2
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -
+     >                    4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+            enddo
+         enddo
+      enddo
+
+      do k = 3, grid_points(3)-4
+         do j = 1, grid_points(2)-2
+            do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) + 
+     >                    u(m,i,j,k+2) )
+               enddo
+            enddo
+         enddo
+      enddo
+         
+      k = grid_points(3)-3
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+            enddo
+         enddo
+      enddo
+
+      k = grid_points(3)-2
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +
+     >                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rhsz)
+
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) * dt
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/set_constants.f
new file mode 100644
index 0000000..6492e42
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/set_constants.f
@@ -0,0 +1,200 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      
+      ce(1,1)  = 2.0d0
+      ce(1,2)  = 0.0d0
+      ce(1,3)  = 0.0d0
+      ce(1,4)  = 4.0d0
+      ce(1,5)  = 5.0d0
+      ce(1,6)  = 3.0d0
+      ce(1,7)  = 0.5d0
+      ce(1,8)  = 0.02d0
+      ce(1,9)  = 0.01d0
+      ce(1,10) = 0.03d0
+      ce(1,11) = 0.5d0
+      ce(1,12) = 0.4d0
+      ce(1,13) = 0.3d0
+      
+      ce(2,1)  = 1.0d0
+      ce(2,2)  = 0.0d0
+      ce(2,3)  = 0.0d0
+      ce(2,4)  = 0.0d0
+      ce(2,5)  = 1.0d0
+      ce(2,6)  = 2.0d0
+      ce(2,7)  = 3.0d0
+      ce(2,8)  = 0.01d0
+      ce(2,9)  = 0.03d0
+      ce(2,10) = 0.02d0
+      ce(2,11) = 0.4d0
+      ce(2,12) = 0.3d0
+      ce(2,13) = 0.5d0
+
+      ce(3,1)  = 2.0d0
+      ce(3,2)  = 2.0d0
+      ce(3,3)  = 0.0d0
+      ce(3,4)  = 0.0d0
+      ce(3,5)  = 0.0d0
+      ce(3,6)  = 2.0d0
+      ce(3,7)  = 3.0d0
+      ce(3,8)  = 0.04d0
+      ce(3,9)  = 0.03d0
+      ce(3,10) = 0.05d0
+      ce(3,11) = 0.3d0
+      ce(3,12) = 0.5d0
+      ce(3,13) = 0.4d0
+
+      ce(4,1)  = 2.0d0
+      ce(4,2)  = 2.0d0
+      ce(4,3)  = 0.0d0
+      ce(4,4)  = 0.0d0
+      ce(4,5)  = 0.0d0
+      ce(4,6)  = 2.0d0
+      ce(4,7)  = 3.0d0
+      ce(4,8)  = 0.03d0
+      ce(4,9)  = 0.05d0
+      ce(4,10) = 0.04d0
+      ce(4,11) = 0.2d0
+      ce(4,12) = 0.1d0
+      ce(4,13) = 0.3d0
+
+      ce(5,1)  = 5.0d0
+      ce(5,2)  = 4.0d0
+      ce(5,3)  = 3.0d0
+      ce(5,4)  = 2.0d0
+      ce(5,5)  = 0.1d0
+      ce(5,6)  = 0.4d0
+      ce(5,7)  = 0.3d0
+      ce(5,8)  = 0.05d0
+      ce(5,9)  = 0.04d0
+      ce(5,10) = 0.03d0
+      ce(5,11) = 0.1d0
+      ce(5,12) = 0.3d0
+      ce(5,13) = 0.2d0
+
+      c1 = 1.4d0
+      c2 = 0.4d0
+      c3 = 0.1d0
+      c4 = 1.0d0
+      c5 = 1.4d0
+
+      dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+      dnym1 = 1.0d0 / dble(grid_points(2)-1)
+      dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+      c1c2 = c1 * c2
+      c1c5 = c1 * c5
+      c3c4 = c3 * c4
+      c1345 = c1c5 * c3c4
+
+      conz1 = (1.0d0-c1c5)
+
+      tx1 = 1.0d0 / (dnxm1 * dnxm1)
+      tx2 = 1.0d0 / (2.0d0 * dnxm1)
+      tx3 = 1.0d0 / dnxm1
+
+      ty1 = 1.0d0 / (dnym1 * dnym1)
+      ty2 = 1.0d0 / (2.0d0 * dnym1)
+      ty3 = 1.0d0 / dnym1
+      
+      tz1 = 1.0d0 / (dnzm1 * dnzm1)
+      tz2 = 1.0d0 / (2.0d0 * dnzm1)
+      tz3 = 1.0d0 / dnzm1
+
+      dx1 = 0.75d0
+      dx2 = 0.75d0
+      dx3 = 0.75d0
+      dx4 = 0.75d0
+      dx5 = 0.75d0
+
+      dy1 = 0.75d0
+      dy2 = 0.75d0
+      dy3 = 0.75d0
+      dy4 = 0.75d0
+      dy5 = 0.75d0
+
+      dz1 = 1.0d0
+      dz2 = 1.0d0
+      dz3 = 1.0d0
+      dz4 = 1.0d0
+      dz5 = 1.0d0
+
+      dxmax = dmax1(dx3, dx4)
+      dymax = dmax1(dy2, dy4)
+      dzmax = dmax1(dz2, dz3)
+
+      dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+      c4dssp = 4.0d0 * dssp
+      c5dssp = 5.0d0 * dssp
+
+      dttx1 = dt*tx1
+      dttx2 = dt*tx2
+      dtty1 = dt*ty1
+      dtty2 = dt*ty2
+      dttz1 = dt*tz1
+      dttz2 = dt*tz2
+
+      c2dttx1 = 2.0d0*dttx1
+      c2dtty1 = 2.0d0*dtty1
+      c2dttz1 = 2.0d0*dttz1
+
+      dtdssp = dt*dssp
+
+      comz1  = dtdssp
+      comz4  = 4.0d0*dtdssp
+      comz5  = 5.0d0*dtdssp
+      comz6  = 6.0d0*dtdssp
+
+      c3c4tx3 = c3c4*tx3
+      c3c4ty3 = c3c4*ty3
+      c3c4tz3 = c3c4*tz3
+
+      dx1tx1 = dx1*tx1
+      dx2tx1 = dx2*tx1
+      dx3tx1 = dx3*tx1
+      dx4tx1 = dx4*tx1
+      dx5tx1 = dx5*tx1
+      
+      dy1ty1 = dy1*ty1
+      dy2ty1 = dy2*ty1
+      dy3ty1 = dy3*ty1
+      dy4ty1 = dy4*ty1
+      dy5ty1 = dy5*ty1
+      
+      dz1tz1 = dz1*tz1
+      dz2tz1 = dz2*tz1
+      dz3tz1 = dz3*tz1
+      dz4tz1 = dz4*tz1
+      dz5tz1 = dz5*tz1
+
+      c2iv  = 2.5d0
+      con43 = 4.0d0/3.0d0
+      con16 = 1.0d0/6.0d0
+      
+      xxcon1 = c3c4tx3*con43*tx3
+      xxcon2 = c3c4tx3*tx3
+      xxcon3 = c3c4tx3*conz1*tx3
+      xxcon4 = c3c4tx3*con16*tx3
+      xxcon5 = c3c4tx3*c1c5*tx3
+
+      yycon1 = c3c4ty3*con43*ty3
+      yycon2 = c3c4ty3*ty3
+      yycon3 = c3c4ty3*conz1*ty3
+      yycon4 = c3c4ty3*con16*ty3
+      yycon5 = c3c4ty3*c1c5*ty3
+
+      zzcon1 = c3c4tz3*con43*tz3
+      zzcon2 = c3c4tz3*tz3
+      zzcon3 = c3c4tz3*conz1*tz3
+      zzcon4 = c3c4tz3*con16*tz3
+      zzcon5 = c3c4tz3*c1c5*tz3
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/solve_subs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/solve_subs.f
new file mode 100644
index 0000000..b2e5479
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/solve_subs.f
@@ -0,0 +1,642 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts bvec=bvec - ablock*avec
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock,avec,bvec
+      dimension ablock(5,5),avec(5),bvec(5)
+
+c---------------------------------------------------------------------
+c            rhs(i,ic,jc,kc) = rhs(i,ic,jc,kc) 
+c     $           - lhs(i,1,ablock,ia)*
+c---------------------------------------------------------------------
+         bvec(1) = bvec(1) - ablock(1,1)*avec(1)
+     >                     - ablock(1,2)*avec(2)
+     >                     - ablock(1,3)*avec(3)
+     >                     - ablock(1,4)*avec(4)
+     >                     - ablock(1,5)*avec(5)
+         bvec(2) = bvec(2) - ablock(2,1)*avec(1)
+     >                     - ablock(2,2)*avec(2)
+     >                     - ablock(2,3)*avec(3)
+     >                     - ablock(2,4)*avec(4)
+     >                     - ablock(2,5)*avec(5)
+         bvec(3) = bvec(3) - ablock(3,1)*avec(1)
+     >                     - ablock(3,2)*avec(2)
+     >                     - ablock(3,3)*avec(3)
+     >                     - ablock(3,4)*avec(4)
+     >                     - ablock(3,5)*avec(5)
+         bvec(4) = bvec(4) - ablock(4,1)*avec(1)
+     >                     - ablock(4,2)*avec(2)
+     >                     - ablock(4,3)*avec(3)
+     >                     - ablock(4,4)*avec(4)
+     >                     - ablock(4,5)*avec(5)
+         bvec(5) = bvec(5) - ablock(5,1)*avec(1)
+     >                     - ablock(5,2)*avec(2)
+     >                     - ablock(5,3)*avec(3)
+     >                     - ablock(5,4)*avec(4)
+     >                     - ablock(5,5)*avec(5)
+
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock, bblock, cblock
+      dimension ablock(5,5), bblock(5,5), cblock(5,5)
+
+
+         cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1)
+     >                             - ablock(1,2)*bblock(2,1)
+     >                             - ablock(1,3)*bblock(3,1)
+     >                             - ablock(1,4)*bblock(4,1)
+     >                             - ablock(1,5)*bblock(5,1)
+         cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1)
+     >                             - ablock(2,2)*bblock(2,1)
+     >                             - ablock(2,3)*bblock(3,1)
+     >                             - ablock(2,4)*bblock(4,1)
+     >                             - ablock(2,5)*bblock(5,1)
+         cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1)
+     >                             - ablock(3,2)*bblock(2,1)
+     >                             - ablock(3,3)*bblock(3,1)
+     >                             - ablock(3,4)*bblock(4,1)
+     >                             - ablock(3,5)*bblock(5,1)
+         cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1)
+     >                             - ablock(4,2)*bblock(2,1)
+     >                             - ablock(4,3)*bblock(3,1)
+     >                             - ablock(4,4)*bblock(4,1)
+     >                             - ablock(4,5)*bblock(5,1)
+         cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1)
+     >                             - ablock(5,2)*bblock(2,1)
+     >                             - ablock(5,3)*bblock(3,1)
+     >                             - ablock(5,4)*bblock(4,1)
+     >                             - ablock(5,5)*bblock(5,1)
+         cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2)
+     >                             - ablock(1,2)*bblock(2,2)
+     >                             - ablock(1,3)*bblock(3,2)
+     >                             - ablock(1,4)*bblock(4,2)
+     >                             - ablock(1,5)*bblock(5,2)
+         cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2)
+     >                             - ablock(2,2)*bblock(2,2)
+     >                             - ablock(2,3)*bblock(3,2)
+     >                             - ablock(2,4)*bblock(4,2)
+     >                             - ablock(2,5)*bblock(5,2)
+         cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2)
+     >                             - ablock(3,2)*bblock(2,2)
+     >                             - ablock(3,3)*bblock(3,2)
+     >                             - ablock(3,4)*bblock(4,2)
+     >                             - ablock(3,5)*bblock(5,2)
+         cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2)
+     >                             - ablock(4,2)*bblock(2,2)
+     >                             - ablock(4,3)*bblock(3,2)
+     >                             - ablock(4,4)*bblock(4,2)
+     >                             - ablock(4,5)*bblock(5,2)
+         cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2)
+     >                             - ablock(5,2)*bblock(2,2)
+     >                             - ablock(5,3)*bblock(3,2)
+     >                             - ablock(5,4)*bblock(4,2)
+     >                             - ablock(5,5)*bblock(5,2)
+         cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3)
+     >                             - ablock(1,2)*bblock(2,3)
+     >                             - ablock(1,3)*bblock(3,3)
+     >                             - ablock(1,4)*bblock(4,3)
+     >                             - ablock(1,5)*bblock(5,3)
+         cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3)
+     >                             - ablock(2,2)*bblock(2,3)
+     >                             - ablock(2,3)*bblock(3,3)
+     >                             - ablock(2,4)*bblock(4,3)
+     >                             - ablock(2,5)*bblock(5,3)
+         cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3)
+     >                             - ablock(3,2)*bblock(2,3)
+     >                             - ablock(3,3)*bblock(3,3)
+     >                             - ablock(3,4)*bblock(4,3)
+     >                             - ablock(3,5)*bblock(5,3)
+         cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3)
+     >                             - ablock(4,2)*bblock(2,3)
+     >                             - ablock(4,3)*bblock(3,3)
+     >                             - ablock(4,4)*bblock(4,3)
+     >                             - ablock(4,5)*bblock(5,3)
+         cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3)
+     >                             - ablock(5,2)*bblock(2,3)
+     >                             - ablock(5,3)*bblock(3,3)
+     >                             - ablock(5,4)*bblock(4,3)
+     >                             - ablock(5,5)*bblock(5,3)
+         cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4)
+     >                             - ablock(1,2)*bblock(2,4)
+     >                             - ablock(1,3)*bblock(3,4)
+     >                             - ablock(1,4)*bblock(4,4)
+     >                             - ablock(1,5)*bblock(5,4)
+         cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4)
+     >                             - ablock(2,2)*bblock(2,4)
+     >                             - ablock(2,3)*bblock(3,4)
+     >                             - ablock(2,4)*bblock(4,4)
+     >                             - ablock(2,5)*bblock(5,4)
+         cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4)
+     >                             - ablock(3,2)*bblock(2,4)
+     >                             - ablock(3,3)*bblock(3,4)
+     >                             - ablock(3,4)*bblock(4,4)
+     >                             - ablock(3,5)*bblock(5,4)
+         cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4)
+     >                             - ablock(4,2)*bblock(2,4)
+     >                             - ablock(4,3)*bblock(3,4)
+     >                             - ablock(4,4)*bblock(4,4)
+     >                             - ablock(4,5)*bblock(5,4)
+         cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4)
+     >                             - ablock(5,2)*bblock(2,4)
+     >                             - ablock(5,3)*bblock(3,4)
+     >                             - ablock(5,4)*bblock(4,4)
+     >                             - ablock(5,5)*bblock(5,4)
+         cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5)
+     >                             - ablock(1,2)*bblock(2,5)
+     >                             - ablock(1,3)*bblock(3,5)
+     >                             - ablock(1,4)*bblock(4,5)
+     >                             - ablock(1,5)*bblock(5,5)
+         cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5)
+     >                             - ablock(2,2)*bblock(2,5)
+     >                             - ablock(2,3)*bblock(3,5)
+     >                             - ablock(2,4)*bblock(4,5)
+     >                             - ablock(2,5)*bblock(5,5)
+         cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5)
+     >                             - ablock(3,2)*bblock(2,5)
+     >                             - ablock(3,3)*bblock(3,5)
+     >                             - ablock(3,4)*bblock(4,5)
+     >                             - ablock(3,5)*bblock(5,5)
+         cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5)
+     >                             - ablock(4,2)*bblock(2,5)
+     >                             - ablock(4,3)*bblock(3,5)
+     >                             - ablock(4,4)*bblock(4,5)
+     >                             - ablock(4,5)*bblock(5,5)
+         cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5)
+     >                             - ablock(5,2)*bblock(2,5)
+     >                             - ablock(5,3)*bblock(3,5)
+     >                             - ablock(5,4)*bblock(4,5)
+     >                             - ablock(5,5)*bblock(5,5)
+
+              
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision c(5,5), r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      c(1,1) = c(1,1)*pivot
+      c(1,2) = c(1,2)*pivot
+      c(1,3) = c(1,3)*pivot
+      c(1,4) = c(1,4)*pivot
+      c(1,5) = c(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      c(2,1) = c(2,1) - coeff*c(1,1)
+      c(2,2) = c(2,2) - coeff*c(1,2)
+      c(2,3) = c(2,3) - coeff*c(1,3)
+      c(2,4) = c(2,4) - coeff*c(1,4)
+      c(2,5) = c(2,5) - coeff*c(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      c(3,1) = c(3,1) - coeff*c(1,1)
+      c(3,2) = c(3,2) - coeff*c(1,2)
+      c(3,3) = c(3,3) - coeff*c(1,3)
+      c(3,4) = c(3,4) - coeff*c(1,4)
+      c(3,5) = c(3,5) - coeff*c(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      c(4,1) = c(4,1) - coeff*c(1,1)
+      c(4,2) = c(4,2) - coeff*c(1,2)
+      c(4,3) = c(4,3) - coeff*c(1,3)
+      c(4,4) = c(4,4) - coeff*c(1,4)
+      c(4,5) = c(4,5) - coeff*c(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      c(5,1) = c(5,1) - coeff*c(1,1)
+      c(5,2) = c(5,2) - coeff*c(1,2)
+      c(5,3) = c(5,3) - coeff*c(1,3)
+      c(5,4) = c(5,4) - coeff*c(1,4)
+      c(5,5) = c(5,5) - coeff*c(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      c(2,1) = c(2,1)*pivot
+      c(2,2) = c(2,2)*pivot
+      c(2,3) = c(2,3)*pivot
+      c(2,4) = c(2,4)*pivot
+      c(2,5) = c(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      c(1,1) = c(1,1) - coeff*c(2,1)
+      c(1,2) = c(1,2) - coeff*c(2,2)
+      c(1,3) = c(1,3) - coeff*c(2,3)
+      c(1,4) = c(1,4) - coeff*c(2,4)
+      c(1,5) = c(1,5) - coeff*c(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      c(3,1) = c(3,1) - coeff*c(2,1)
+      c(3,2) = c(3,2) - coeff*c(2,2)
+      c(3,3) = c(3,3) - coeff*c(2,3)
+      c(3,4) = c(3,4) - coeff*c(2,4)
+      c(3,5) = c(3,5) - coeff*c(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      c(4,1) = c(4,1) - coeff*c(2,1)
+      c(4,2) = c(4,2) - coeff*c(2,2)
+      c(4,3) = c(4,3) - coeff*c(2,3)
+      c(4,4) = c(4,4) - coeff*c(2,4)
+      c(4,5) = c(4,5) - coeff*c(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      c(5,1) = c(5,1) - coeff*c(2,1)
+      c(5,2) = c(5,2) - coeff*c(2,2)
+      c(5,3) = c(5,3) - coeff*c(2,3)
+      c(5,4) = c(5,4) - coeff*c(2,4)
+      c(5,5) = c(5,5) - coeff*c(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      c(3,1) = c(3,1)*pivot
+      c(3,2) = c(3,2)*pivot
+      c(3,3) = c(3,3)*pivot
+      c(3,4) = c(3,4)*pivot
+      c(3,5) = c(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      c(1,1) = c(1,1) - coeff*c(3,1)
+      c(1,2) = c(1,2) - coeff*c(3,2)
+      c(1,3) = c(1,3) - coeff*c(3,3)
+      c(1,4) = c(1,4) - coeff*c(3,4)
+      c(1,5) = c(1,5) - coeff*c(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      c(2,1) = c(2,1) - coeff*c(3,1)
+      c(2,2) = c(2,2) - coeff*c(3,2)
+      c(2,3) = c(2,3) - coeff*c(3,3)
+      c(2,4) = c(2,4) - coeff*c(3,4)
+      c(2,5) = c(2,5) - coeff*c(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      c(4,1) = c(4,1) - coeff*c(3,1)
+      c(4,2) = c(4,2) - coeff*c(3,2)
+      c(4,3) = c(4,3) - coeff*c(3,3)
+      c(4,4) = c(4,4) - coeff*c(3,4)
+      c(4,5) = c(4,5) - coeff*c(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      c(5,1) = c(5,1) - coeff*c(3,1)
+      c(5,2) = c(5,2) - coeff*c(3,2)
+      c(5,3) = c(5,3) - coeff*c(3,3)
+      c(5,4) = c(5,4) - coeff*c(3,4)
+      c(5,5) = c(5,5) - coeff*c(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      c(4,1) = c(4,1)*pivot
+      c(4,2) = c(4,2)*pivot
+      c(4,3) = c(4,3)*pivot
+      c(4,4) = c(4,4)*pivot
+      c(4,5) = c(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      c(1,1) = c(1,1) - coeff*c(4,1)
+      c(1,2) = c(1,2) - coeff*c(4,2)
+      c(1,3) = c(1,3) - coeff*c(4,3)
+      c(1,4) = c(1,4) - coeff*c(4,4)
+      c(1,5) = c(1,5) - coeff*c(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      c(2,1) = c(2,1) - coeff*c(4,1)
+      c(2,2) = c(2,2) - coeff*c(4,2)
+      c(2,3) = c(2,3) - coeff*c(4,3)
+      c(2,4) = c(2,4) - coeff*c(4,4)
+      c(2,5) = c(2,5) - coeff*c(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      c(3,1) = c(3,1) - coeff*c(4,1)
+      c(3,2) = c(3,2) - coeff*c(4,2)
+      c(3,3) = c(3,3) - coeff*c(4,3)
+      c(3,4) = c(3,4) - coeff*c(4,4)
+      c(3,5) = c(3,5) - coeff*c(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      c(5,1) = c(5,1) - coeff*c(4,1)
+      c(5,2) = c(5,2) - coeff*c(4,2)
+      c(5,3) = c(5,3) - coeff*c(4,3)
+      c(5,4) = c(5,4) - coeff*c(4,4)
+      c(5,5) = c(5,5) - coeff*c(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      c(5,1) = c(5,1)*pivot
+      c(5,2) = c(5,2)*pivot
+      c(5,3) = c(5,3)*pivot
+      c(5,4) = c(5,4)*pivot
+      c(5,5) = c(5,5)*pivot
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      c(1,1) = c(1,1) - coeff*c(5,1)
+      c(1,2) = c(1,2) - coeff*c(5,2)
+      c(1,3) = c(1,3) - coeff*c(5,3)
+      c(1,4) = c(1,4) - coeff*c(5,4)
+      c(1,5) = c(1,5) - coeff*c(5,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      c(2,1) = c(2,1) - coeff*c(5,1)
+      c(2,2) = c(2,2) - coeff*c(5,2)
+      c(2,3) = c(2,3) - coeff*c(5,3)
+      c(2,4) = c(2,4) - coeff*c(5,4)
+      c(2,5) = c(2,5) - coeff*c(5,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      c(3,1) = c(3,1) - coeff*c(5,1)
+      c(3,2) = c(3,2) - coeff*c(5,2)
+      c(3,3) = c(3,3) - coeff*c(5,3)
+      c(3,4) = c(3,4) - coeff*c(5,4)
+      c(3,5) = c(3,5) - coeff*c(5,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      c(4,1) = c(4,1) - coeff*c(5,1)
+      c(4,2) = c(4,2) - coeff*c(5,2)
+      c(4,3) = c(4,3) - coeff*c(5,3)
+      c(4,4) = c(4,4) - coeff*c(5,4)
+      c(4,5) = c(4,5) - coeff*c(5,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision r(5)
+
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/verify.f
new file mode 100644
index 0000000..52551bf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/verify.f
@@ -0,0 +1,358 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 60    ))  then
+
+           class = 'S'
+           dtref = 1.0d-2
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.7034283709541311d-01
+         xcrref(2) = 1.2975252070034097d-02
+         xcrref(3) = 3.2527926989486055d-02
+         xcrref(4) = 2.6436421275166801d-02
+         xcrref(5) = 1.9211784131744430d-01
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 4.9976913345811579d-04
+         xceref(2) = 4.5195666782961927d-05
+         xceref(3) = 7.3973765172921357d-05
+         xceref(4) = 7.3821238632439731d-05
+         xceref(5) = 8.9269630987491446d-04
+
+c---------------------------------------------------------------------
+c    reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 24) .and. 
+     >           (grid_points(2) .eq. 24) .and.
+     >           (grid_points(3) .eq. 24) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'W'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1125590409344d+03
+           xcrref(2) = 0.1180007595731d+02
+           xcrref(3) = 0.2710329767846d+02
+           xcrref(4) = 0.2469174937669d+02
+           xcrref(5) = 0.2638427874317d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.4419655736008d+01
+           xceref(2) = 0.4638531260002d+00
+           xceref(3) = 0.1011551749967d+01
+           xceref(4) = 0.9235878729944d+00
+           xceref(5) = 0.1018045837718d+02
+
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'A'
+           dtref = 0.8d-3
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.0806346714637264d+02
+         xcrref(2) = 1.1319730901220813d+01
+         xcrref(3) = 2.5974354511582465d+01
+         xcrref(4) = 2.3665622544678910d+01
+         xcrref(5) = 2.5278963211748344d+02
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 4.2348416040525025d+00
+         xceref(2) = 4.4390282496995698d-01
+         xceref(3) = 9.6692480136345650d-01
+         xceref(4) = 8.8302063039765474d-01
+         xceref(5) = 9.7379901770829278d+00
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 200 time steps,
+c    with DT = 3.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'B'
+           dtref = 3.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 1.4233597229287254d+03
+         xcrref(2) = 9.9330522590150238d+01
+         xcrref(3) = 3.5646025644535285d+02
+         xcrref(4) = 3.2485447959084092d+02
+         xcrref(5) = 3.2707541254659363d+03
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 5.2969847140936856d+01
+         xceref(2) = 4.4632896115670668d+00
+         xceref(3) = 1.3122573342210174d+01
+         xceref(4) = 1.2006925323559144d+01
+         xceref(5) = 1.2459576151035986d+02
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 200 time steps,
+c    with DT = 1.0d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 200) ) then
+
+           class = 'C'
+           dtref = 1.0d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.62398116551764615d+04
+         xcrref(2) = 0.50793239190423964d+03
+         xcrref(3) = 0.15423530093013596d+04
+         xcrref(4) = 0.13302387929291190d+04
+         xcrref(5) = 0.11604087428436455d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+         xceref(1) = 0.16462008369091265d+03
+         xceref(2) = 0.11497107903824313d+02
+         xceref(3) = 0.41207446207461508d+02
+         xceref(4) = 0.37087651059694167d+02
+         xceref(5) = 0.36211053051841265d+03
+
+c---------------------------------------------------------------------
+c    reference data for 408x408x408 grids after 250 time steps,
+c    with DT = 0.2d-04
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'D'
+           dtref = 0.2d-4
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2533188551738d+05
+         xcrref(2) = 0.2346393716980d+04
+         xcrref(3) = 0.6294554366904d+04
+         xcrref(4) = 0.5352565376030d+04
+         xcrref(5) = 0.3905864038618d+05
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         xceref(1) = 0.3100009377557d+03
+         xceref(2) = 0.2424086324913d+02
+         xceref(3) = 0.7782212022645d+02
+         xceref(4) = 0.6835623860116d+02
+         xceref(5) = 0.6065737200368d+03
+
+
+c---------------------------------------------------------------------
+c    reference data for 1020x1020x1020 grids after 250 time steps,
+c    with DT = 0.4d-05
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 250) ) then
+
+           class = 'E'
+           dtref = 0.4d-5
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+         xcrref(1) = 0.9795372484517d+05
+         xcrref(2) = 0.9739814511521d+04
+         xcrref(3) = 0.2467606342965d+05
+         xcrref(4) = 0.2092419572860d+05
+         xcrref(5) = 0.1392138856939d+06
+
+c---------------------------------------------------------------------
+c  Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+
+         xceref(1) = 0.4327562208414d+03
+         xceref(2) = 0.3699051964887d+02
+         xceref(3) = 0.1089845040954d+03
+         xceref(4) = 0.9462517622043d+02
+         xceref(5) = 0.7765512765309d+03
+
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs.h
new file mode 100644
index 0000000..d3c499a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs.h
@@ -0,0 +1,13 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  header.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+      double precision fjac(5, 5,    0:problem_size),
+     >                 njac(5, 5,    0:problem_size),
+     >                 lhs (5, 5, 3, 0:problem_size),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhs, tmp1, tmp2, tmp3
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs_vec.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs_vec.h
new file mode 100644
index 0000000..2ee2be7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/work_lhs_vec.h
@@ -0,0 +1,13 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+c  header.h
+c
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c
+      double precision fjac(5, 5,    0:problem_size, 0:problem_size),
+     >                 njac(5, 5,    0:problem_size, 0:problem_size),
+     >                 lhs (5, 5, 3, 0:problem_size, 0:problem_size),
+     >                 tmp1, tmp2, tmp3
+      common /work_lhs/ fjac, njac, lhs, tmp1, tmp2, tmp3
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve.f
new file mode 100644
index 0000000..3b761d2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve.f
@@ -0,0 +1,399 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i,j,k,m,n,isize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 0, isize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i) = 0.0d+00
+               fjac(1,2,i) = 1.0d+00
+               fjac(1,3,i) = 0.0d+00
+               fjac(1,4,i) = 0.0d+00
+               fjac(1,5,i) = 0.0d+00
+
+               fjac(2,1,i) = -(u(2,i,j,k) * tmp2 * 
+     >              u(2,i,j,k))
+     >              + c2 * qs(i,j,k)
+               fjac(2,2,i) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(2,3,i) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(2,4,i) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(2,5,i) = c2
+
+               fjac(3,1,i) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(3,2,i) = u(3,i,j,k) * tmp1
+               fjac(3,3,i) = u(2,i,j,k) * tmp1
+               fjac(3,4,i) = 0.0d+00
+               fjac(3,5,i) = 0.0d+00
+
+               fjac(4,1,i) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(4,2,i) = u(4,i,j,k) * tmp1
+               fjac(4,3,i) = 0.0d+00
+               fjac(4,4,i) = u(2,i,j,k) * tmp1
+               fjac(4,5,i) = 0.0d+00
+
+               fjac(5,1,i) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * ( u(2,i,j,k) * tmp2 )
+               fjac(5,2,i) = c1 *  u(5,i,j,k) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2
+     >              + qs(i,j,k) )
+               fjac(5,3,i) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(1,1,i) = 0.0d+00
+               njac(1,2,i) = 0.0d+00
+               njac(1,3,i) = 0.0d+00
+               njac(1,4,i) = 0.0d+00
+               njac(1,5,i) = 0.0d+00
+
+               njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i) =   con43 * c3c4 * tmp1
+               njac(2,3,i) =   0.0d+00
+               njac(2,4,i) =   0.0d+00
+               njac(2,5,i) =   0.0d+00
+
+               njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i) =   0.0d+00
+               njac(3,3,i) =   c3c4 * tmp1
+               njac(3,4,i) =   0.0d+00
+               njac(3,5,i) =   0.0d+00
+
+               njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i) =   0.0d+00 
+               njac(4,3,i) =   0.0d+00
+               njac(4,4,i) =   c3c4 * tmp1
+               njac(4,5,i) =   0.0d+00
+
+               njac(5,1,i) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i) = ( c1345 ) * tmp1
+
+            enddo
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, isize)
+            do i = 1, isize-1
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhs(1,1,aa,i) = - tmp2 * fjac(1,1,i-1)
+     >              - tmp1 * njac(1,1,i-1)
+     >              - tmp1 * dx1 
+               lhs(1,2,aa,i) = - tmp2 * fjac(1,2,i-1)
+     >              - tmp1 * njac(1,2,i-1)
+               lhs(1,3,aa,i) = - tmp2 * fjac(1,3,i-1)
+     >              - tmp1 * njac(1,3,i-1)
+               lhs(1,4,aa,i) = - tmp2 * fjac(1,4,i-1)
+     >              - tmp1 * njac(1,4,i-1)
+               lhs(1,5,aa,i) = - tmp2 * fjac(1,5,i-1)
+     >              - tmp1 * njac(1,5,i-1)
+
+               lhs(2,1,aa,i) = - tmp2 * fjac(2,1,i-1)
+     >              - tmp1 * njac(2,1,i-1)
+               lhs(2,2,aa,i) = - tmp2 * fjac(2,2,i-1)
+     >              - tmp1 * njac(2,2,i-1)
+     >              - tmp1 * dx2
+               lhs(2,3,aa,i) = - tmp2 * fjac(2,3,i-1)
+     >              - tmp1 * njac(2,3,i-1)
+               lhs(2,4,aa,i) = - tmp2 * fjac(2,4,i-1)
+     >              - tmp1 * njac(2,4,i-1)
+               lhs(2,5,aa,i) = - tmp2 * fjac(2,5,i-1)
+     >              - tmp1 * njac(2,5,i-1)
+
+               lhs(3,1,aa,i) = - tmp2 * fjac(3,1,i-1)
+     >              - tmp1 * njac(3,1,i-1)
+               lhs(3,2,aa,i) = - tmp2 * fjac(3,2,i-1)
+     >              - tmp1 * njac(3,2,i-1)
+               lhs(3,3,aa,i) = - tmp2 * fjac(3,3,i-1)
+     >              - tmp1 * njac(3,3,i-1)
+     >              - tmp1 * dx3 
+               lhs(3,4,aa,i) = - tmp2 * fjac(3,4,i-1)
+     >              - tmp1 * njac(3,4,i-1)
+               lhs(3,5,aa,i) = - tmp2 * fjac(3,5,i-1)
+     >              - tmp1 * njac(3,5,i-1)
+
+               lhs(4,1,aa,i) = - tmp2 * fjac(4,1,i-1)
+     >              - tmp1 * njac(4,1,i-1)
+               lhs(4,2,aa,i) = - tmp2 * fjac(4,2,i-1)
+     >              - tmp1 * njac(4,2,i-1)
+               lhs(4,3,aa,i) = - tmp2 * fjac(4,3,i-1)
+     >              - tmp1 * njac(4,3,i-1)
+               lhs(4,4,aa,i) = - tmp2 * fjac(4,4,i-1)
+     >              - tmp1 * njac(4,4,i-1)
+     >              - tmp1 * dx4
+               lhs(4,5,aa,i) = - tmp2 * fjac(4,5,i-1)
+     >              - tmp1 * njac(4,5,i-1)
+
+               lhs(5,1,aa,i) = - tmp2 * fjac(5,1,i-1)
+     >              - tmp1 * njac(5,1,i-1)
+               lhs(5,2,aa,i) = - tmp2 * fjac(5,2,i-1)
+     >              - tmp1 * njac(5,2,i-1)
+               lhs(5,3,aa,i) = - tmp2 * fjac(5,3,i-1)
+     >              - tmp1 * njac(5,3,i-1)
+               lhs(5,4,aa,i) = - tmp2 * fjac(5,4,i-1)
+     >              - tmp1 * njac(5,4,i-1)
+               lhs(5,5,aa,i) = - tmp2 * fjac(5,5,i-1)
+     >              - tmp1 * njac(5,5,i-1)
+     >              - tmp1 * dx5
+
+               lhs(1,1,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhs(1,2,bb,i) = tmp1 * 2.0d+00 * njac(1,2,i)
+               lhs(1,3,bb,i) = tmp1 * 2.0d+00 * njac(1,3,i)
+               lhs(1,4,bb,i) = tmp1 * 2.0d+00 * njac(1,4,i)
+               lhs(1,5,bb,i) = tmp1 * 2.0d+00 * njac(1,5,i)
+
+               lhs(2,1,bb,i) = tmp1 * 2.0d+00 * njac(2,1,i)
+               lhs(2,2,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhs(2,3,bb,i) = tmp1 * 2.0d+00 * njac(2,3,i)
+               lhs(2,4,bb,i) = tmp1 * 2.0d+00 * njac(2,4,i)
+               lhs(2,5,bb,i) = tmp1 * 2.0d+00 * njac(2,5,i)
+
+               lhs(3,1,bb,i) = tmp1 * 2.0d+00 * njac(3,1,i)
+               lhs(3,2,bb,i) = tmp1 * 2.0d+00 * njac(3,2,i)
+               lhs(3,3,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhs(3,4,bb,i) = tmp1 * 2.0d+00 * njac(3,4,i)
+               lhs(3,5,bb,i) = tmp1 * 2.0d+00 * njac(3,5,i)
+
+               lhs(4,1,bb,i) = tmp1 * 2.0d+00 * njac(4,1,i)
+               lhs(4,2,bb,i) = tmp1 * 2.0d+00 * njac(4,2,i)
+               lhs(4,3,bb,i) = tmp1 * 2.0d+00 * njac(4,3,i)
+               lhs(4,4,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhs(4,5,bb,i) = tmp1 * 2.0d+00 * njac(4,5,i)
+
+               lhs(5,1,bb,i) = tmp1 * 2.0d+00 * njac(5,1,i)
+               lhs(5,2,bb,i) = tmp1 * 2.0d+00 * njac(5,2,i)
+               lhs(5,3,bb,i) = tmp1 * 2.0d+00 * njac(5,3,i)
+               lhs(5,4,bb,i) = tmp1 * 2.0d+00 * njac(5,4,i)
+               lhs(5,5,bb,i) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhs(1,1,cc,i) =  tmp2 * fjac(1,1,i+1)
+     >              - tmp1 * njac(1,1,i+1)
+     >              - tmp1 * dx1
+               lhs(1,2,cc,i) =  tmp2 * fjac(1,2,i+1)
+     >              - tmp1 * njac(1,2,i+1)
+               lhs(1,3,cc,i) =  tmp2 * fjac(1,3,i+1)
+     >              - tmp1 * njac(1,3,i+1)
+               lhs(1,4,cc,i) =  tmp2 * fjac(1,4,i+1)
+     >              - tmp1 * njac(1,4,i+1)
+               lhs(1,5,cc,i) =  tmp2 * fjac(1,5,i+1)
+     >              - tmp1 * njac(1,5,i+1)
+
+               lhs(2,1,cc,i) =  tmp2 * fjac(2,1,i+1)
+     >              - tmp1 * njac(2,1,i+1)
+               lhs(2,2,cc,i) =  tmp2 * fjac(2,2,i+1)
+     >              - tmp1 * njac(2,2,i+1)
+     >              - tmp1 * dx2
+               lhs(2,3,cc,i) =  tmp2 * fjac(2,3,i+1)
+     >              - tmp1 * njac(2,3,i+1)
+               lhs(2,4,cc,i) =  tmp2 * fjac(2,4,i+1)
+     >              - tmp1 * njac(2,4,i+1)
+               lhs(2,5,cc,i) =  tmp2 * fjac(2,5,i+1)
+     >              - tmp1 * njac(2,5,i+1)
+
+               lhs(3,1,cc,i) =  tmp2 * fjac(3,1,i+1)
+     >              - tmp1 * njac(3,1,i+1)
+               lhs(3,2,cc,i) =  tmp2 * fjac(3,2,i+1)
+     >              - tmp1 * njac(3,2,i+1)
+               lhs(3,3,cc,i) =  tmp2 * fjac(3,3,i+1)
+     >              - tmp1 * njac(3,3,i+1)
+     >              - tmp1 * dx3
+               lhs(3,4,cc,i) =  tmp2 * fjac(3,4,i+1)
+     >              - tmp1 * njac(3,4,i+1)
+               lhs(3,5,cc,i) =  tmp2 * fjac(3,5,i+1)
+     >              - tmp1 * njac(3,5,i+1)
+
+               lhs(4,1,cc,i) =  tmp2 * fjac(4,1,i+1)
+     >              - tmp1 * njac(4,1,i+1)
+               lhs(4,2,cc,i) =  tmp2 * fjac(4,2,i+1)
+     >              - tmp1 * njac(4,2,i+1)
+               lhs(4,3,cc,i) =  tmp2 * fjac(4,3,i+1)
+     >              - tmp1 * njac(4,3,i+1)
+               lhs(4,4,cc,i) =  tmp2 * fjac(4,4,i+1)
+     >              - tmp1 * njac(4,4,i+1)
+     >              - tmp1 * dx4
+               lhs(4,5,cc,i) =  tmp2 * fjac(4,5,i+1)
+     >              - tmp1 * njac(4,5,i+1)
+
+               lhs(5,1,cc,i) =  tmp2 * fjac(5,1,i+1)
+     >              - tmp1 * njac(5,1,i+1)
+               lhs(5,2,cc,i) =  tmp2 * fjac(5,2,i+1)
+     >              - tmp1 * njac(5,2,i+1)
+               lhs(5,3,cc,i) =  tmp2 * fjac(5,3,i+1)
+     >              - tmp1 * njac(5,3,i+1)
+               lhs(5,4,cc,i) =  tmp2 * fjac(5,4,i+1)
+     >              - tmp1 * njac(5,4,i+1)
+               lhs(5,5,cc,i) =  tmp2 * fjac(5,5,i+1)
+     >              - tmp1 * njac(5,5,i+1)
+     >              - tmp1 * dx5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(0,j,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,0,j,k) )
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do i=1,isize-1
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i),
+     >                         rhs(1,i-1,j,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i),
+     >                         lhs(1,1,cc,i-1),
+     >                         lhs(1,1,bb,i))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i),
+     >                        lhs(1,1,cc,i),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,isize),
+     >                         rhs(1,isize-1,j,k),rhs(1,isize,j,k))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,isize),
+     >                         lhs(1,1,cc,isize-1),
+     >                         lhs(1,1,bb,isize))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,isize),
+     >                       rhs(1,isize,j,k) )
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+            do i=isize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i)*rhs(n,i+1,j,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve_vec.f
new file mode 100644
index 0000000..3ee9a37
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/x_solve_vec.f
@@ -0,0 +1,432 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     
+c     Performs line solves in X direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c     
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i,j,k,m,n,isize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side in the xi-direction
+c---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+c---------------------------------------------------------------------
+c     determine a (labeled f) and n jacobians
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 0, isize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+c---------------------------------------------------------------------
+c     
+c---------------------------------------------------------------------
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 1.0d+00
+               fjac(1,3,i,j) = 0.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = -(u(2,i,j,k) * tmp2 * 
+     >              u(2,i,j,k))
+     >              + c2 * qs(i,j,k)
+               fjac(2,2,i,j) = ( 2.0d+00 - c2 )
+     >              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(2,3,i,j) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(2,4,i,j) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(2,5,i,j) = c2
+
+               fjac(3,1,i,j) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(3,2,i,j) = u(3,i,j,k) * tmp1
+               fjac(3,3,i,j) = u(2,i,j,k) * tmp1
+               fjac(3,4,i,j) = 0.0d+00
+               fjac(3,5,i,j) = 0.0d+00
+
+               fjac(4,1,i,j) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(4,2,i,j) = u(4,i,j,k) * tmp1
+               fjac(4,3,i,j) = 0.0d+00
+               fjac(4,4,i,j) = u(2,i,j,k) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * ( u(2,i,j,k) * tmp2 )
+               fjac(5,2,i,j) = c1 *  u(5,i,j,k) * tmp1 
+     >              - c2
+     >              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2
+     >              + qs(i,j,k) )
+               fjac(5,3,i,j) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i,j) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,j) =   con43 * c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,j) =   0.0d+00 
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,0,j) = 0.0d0
+                  lhs(m,n,bb,0,j) = 0.0d0
+                  lhs(m,n,cc,0,j) = 0.0d0
+                  lhs(m,n,aa,isize,j) = 0.0d0
+                  lhs(m,n,bb,isize,j) = 0.0d0
+                  lhs(m,n,cc,isize,j) = 0.0d0
+               end do
+               lhs(m,m,bb,0,j) = 1.0d0
+               lhs(m,m,bb,isize,j) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in x direction
+c---------------------------------------------------------------------
+         do j = 1, grid_points(2)-2
+            do i = 1, isize-1
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhs(1,1,aa,i,j) = - tmp2 * fjac(1,1,i-1,j)
+     >              - tmp1 * njac(1,1,i-1,j)
+     >              - tmp1 * dx1 
+               lhs(1,2,aa,i,j) = - tmp2 * fjac(1,2,i-1,j)
+     >              - tmp1 * njac(1,2,i-1,j)
+               lhs(1,3,aa,i,j) = - tmp2 * fjac(1,3,i-1,j)
+     >              - tmp1 * njac(1,3,i-1,j)
+               lhs(1,4,aa,i,j) = - tmp2 * fjac(1,4,i-1,j)
+     >              - tmp1 * njac(1,4,i-1,j)
+               lhs(1,5,aa,i,j) = - tmp2 * fjac(1,5,i-1,j)
+     >              - tmp1 * njac(1,5,i-1,j)
+
+               lhs(2,1,aa,i,j) = - tmp2 * fjac(2,1,i-1,j)
+     >              - tmp1 * njac(2,1,i-1,j)
+               lhs(2,2,aa,i,j) = - tmp2 * fjac(2,2,i-1,j)
+     >              - tmp1 * njac(2,2,i-1,j)
+     >              - tmp1 * dx2
+               lhs(2,3,aa,i,j) = - tmp2 * fjac(2,3,i-1,j)
+     >              - tmp1 * njac(2,3,i-1,j)
+               lhs(2,4,aa,i,j) = - tmp2 * fjac(2,4,i-1,j)
+     >              - tmp1 * njac(2,4,i-1,j)
+               lhs(2,5,aa,i,j) = - tmp2 * fjac(2,5,i-1,j)
+     >              - tmp1 * njac(2,5,i-1,j)
+
+               lhs(3,1,aa,i,j) = - tmp2 * fjac(3,1,i-1,j)
+     >              - tmp1 * njac(3,1,i-1,j)
+               lhs(3,2,aa,i,j) = - tmp2 * fjac(3,2,i-1,j)
+     >              - tmp1 * njac(3,2,i-1,j)
+               lhs(3,3,aa,i,j) = - tmp2 * fjac(3,3,i-1,j)
+     >              - tmp1 * njac(3,3,i-1,j)
+     >              - tmp1 * dx3 
+               lhs(3,4,aa,i,j) = - tmp2 * fjac(3,4,i-1,j)
+     >              - tmp1 * njac(3,4,i-1,j)
+               lhs(3,5,aa,i,j) = - tmp2 * fjac(3,5,i-1,j)
+     >              - tmp1 * njac(3,5,i-1,j)
+
+               lhs(4,1,aa,i,j) = - tmp2 * fjac(4,1,i-1,j)
+     >              - tmp1 * njac(4,1,i-1,j)
+               lhs(4,2,aa,i,j) = - tmp2 * fjac(4,2,i-1,j)
+     >              - tmp1 * njac(4,2,i-1,j)
+               lhs(4,3,aa,i,j) = - tmp2 * fjac(4,3,i-1,j)
+     >              - tmp1 * njac(4,3,i-1,j)
+               lhs(4,4,aa,i,j) = - tmp2 * fjac(4,4,i-1,j)
+     >              - tmp1 * njac(4,4,i-1,j)
+     >              - tmp1 * dx4
+               lhs(4,5,aa,i,j) = - tmp2 * fjac(4,5,i-1,j)
+     >              - tmp1 * njac(4,5,i-1,j)
+
+               lhs(5,1,aa,i,j) = - tmp2 * fjac(5,1,i-1,j)
+     >              - tmp1 * njac(5,1,i-1,j)
+               lhs(5,2,aa,i,j) = - tmp2 * fjac(5,2,i-1,j)
+     >              - tmp1 * njac(5,2,i-1,j)
+               lhs(5,3,aa,i,j) = - tmp2 * fjac(5,3,i-1,j)
+     >              - tmp1 * njac(5,3,i-1,j)
+               lhs(5,4,aa,i,j) = - tmp2 * fjac(5,4,i-1,j)
+     >              - tmp1 * njac(5,4,i-1,j)
+               lhs(5,5,aa,i,j) = - tmp2 * fjac(5,5,i-1,j)
+     >              - tmp1 * njac(5,5,i-1,j)
+     >              - tmp1 * dx5
+
+               lhs(1,1,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dx1
+               lhs(1,2,bb,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhs(1,3,bb,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhs(1,4,bb,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhs(1,5,bb,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhs(2,1,bb,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhs(2,2,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dx2
+               lhs(2,3,bb,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhs(2,4,bb,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhs(2,5,bb,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhs(3,1,bb,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhs(3,2,bb,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhs(3,3,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dx3
+               lhs(3,4,bb,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhs(3,5,bb,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhs(4,1,bb,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhs(4,2,bb,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhs(4,3,bb,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhs(4,4,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dx4
+               lhs(4,5,bb,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhs(5,1,bb,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhs(5,2,bb,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhs(5,3,bb,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhs(5,4,bb,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhs(5,5,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j)
+     >              + tmp1 * 2.0d+00 * dx5
+
+               lhs(1,1,cc,i,j) =  tmp2 * fjac(1,1,i+1,j)
+     >              - tmp1 * njac(1,1,i+1,j)
+     >              - tmp1 * dx1
+               lhs(1,2,cc,i,j) =  tmp2 * fjac(1,2,i+1,j)
+     >              - tmp1 * njac(1,2,i+1,j)
+               lhs(1,3,cc,i,j) =  tmp2 * fjac(1,3,i+1,j)
+     >              - tmp1 * njac(1,3,i+1,j)
+               lhs(1,4,cc,i,j) =  tmp2 * fjac(1,4,i+1,j)
+     >              - tmp1 * njac(1,4,i+1,j)
+               lhs(1,5,cc,i,j) =  tmp2 * fjac(1,5,i+1,j)
+     >              - tmp1 * njac(1,5,i+1,j)
+
+               lhs(2,1,cc,i,j) =  tmp2 * fjac(2,1,i+1,j)
+     >              - tmp1 * njac(2,1,i+1,j)
+               lhs(2,2,cc,i,j) =  tmp2 * fjac(2,2,i+1,j)
+     >              - tmp1 * njac(2,2,i+1,j)
+     >              - tmp1 * dx2
+               lhs(2,3,cc,i,j) =  tmp2 * fjac(2,3,i+1,j)
+     >              - tmp1 * njac(2,3,i+1,j)
+               lhs(2,4,cc,i,j) =  tmp2 * fjac(2,4,i+1,j)
+     >              - tmp1 * njac(2,4,i+1,j)
+               lhs(2,5,cc,i,j) =  tmp2 * fjac(2,5,i+1,j)
+     >              - tmp1 * njac(2,5,i+1,j)
+
+               lhs(3,1,cc,i,j) =  tmp2 * fjac(3,1,i+1,j)
+     >              - tmp1 * njac(3,1,i+1,j)
+               lhs(3,2,cc,i,j) =  tmp2 * fjac(3,2,i+1,j)
+     >              - tmp1 * njac(3,2,i+1,j)
+               lhs(3,3,cc,i,j) =  tmp2 * fjac(3,3,i+1,j)
+     >              - tmp1 * njac(3,3,i+1,j)
+     >              - tmp1 * dx3
+               lhs(3,4,cc,i,j) =  tmp2 * fjac(3,4,i+1,j)
+     >              - tmp1 * njac(3,4,i+1,j)
+               lhs(3,5,cc,i,j) =  tmp2 * fjac(3,5,i+1,j)
+     >              - tmp1 * njac(3,5,i+1,j)
+
+               lhs(4,1,cc,i,j) =  tmp2 * fjac(4,1,i+1,j)
+     >              - tmp1 * njac(4,1,i+1,j)
+               lhs(4,2,cc,i,j) =  tmp2 * fjac(4,2,i+1,j)
+     >              - tmp1 * njac(4,2,i+1,j)
+               lhs(4,3,cc,i,j) =  tmp2 * fjac(4,3,i+1,j)
+     >              - tmp1 * njac(4,3,i+1,j)
+               lhs(4,4,cc,i,j) =  tmp2 * fjac(4,4,i+1,j)
+     >              - tmp1 * njac(4,4,i+1,j)
+     >              - tmp1 * dx4
+               lhs(4,5,cc,i,j) =  tmp2 * fjac(4,5,i+1,j)
+     >              - tmp1 * njac(4,5,i+1,j)
+
+               lhs(5,1,cc,i,j) =  tmp2 * fjac(5,1,i+1,j)
+     >              - tmp1 * njac(5,1,i+1,j)
+               lhs(5,2,cc,i,j) =  tmp2 * fjac(5,2,i+1,j)
+     >              - tmp1 * njac(5,2,i+1,j)
+               lhs(5,3,cc,i,j) =  tmp2 * fjac(5,3,i+1,j)
+     >              - tmp1 * njac(5,3,i+1,j)
+               lhs(5,4,cc,i,j) =  tmp2 * fjac(5,4,i+1,j)
+     >              - tmp1 * njac(5,4,i+1,j)
+               lhs(5,5,cc,i,j) =  tmp2 * fjac(5,5,i+1,j)
+     >              - tmp1 * njac(5,5,i+1,j)
+     >              - tmp1 * dx5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(0,j,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do j = 1, grid_points(2)-2
+            call binvcrhs( lhs(1,1,bb,0,j),
+     >                        lhs(1,1,cc,0,j),
+     >                        rhs(1,0,j,k) )
+         enddo
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+!dir$ ivdep
+!dir$ interchange(i,j)
+         do j = 1, grid_points(2)-2
+            do i=1,isize-1
+
+c---------------------------------------------------------------------
+c     rhs(i) = rhs(i) - A*rhs(i-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,j),
+     >                         rhs(1,i-1,j,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(i) = B(i) - C(i-1)*A(i)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,j),
+     >                         lhs(1,1,cc,i-1,j),
+     >                         lhs(1,1,bb,i,j))
+
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,j),
+     >                        lhs(1,1,cc,i,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do j = 1, grid_points(2)-2
+            call matvec_sub(lhs(1,1,aa,isize,j),
+     >                         rhs(1,isize-1,j,k),rhs(1,isize,j,k))
+
+c---------------------------------------------------------------------
+c     B(isize) = B(isize) - C(isize-1)*A(isize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,isize,j),
+     >                         lhs(1,1,cc,isize-1,j),
+     >                         lhs(1,1,bb,isize,j))
+
+c---------------------------------------------------------------------
+c     multiply rhs() by b_inverse() and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,isize,j),
+     >                       rhs(1,isize,j,k) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(isize)=rhs(isize)
+c     else assume U(isize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(istart) will be sent to next cell
+c---------------------------------------------------------------------
+
+         do j = 1, grid_points(2)-2
+            do i=isize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,j)*rhs(n,i+1,j,k)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve.f
new file mode 100644
index 0000000..43cbdec
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve.f
@@ -0,0 +1,399 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i, j, k, m, n, jsize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do i = 1, grid_points(1)-2
+            do j = 0, jsize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,j) = 0.0d+00
+               fjac(1,2,j) = 0.0d+00
+               fjac(1,3,j) = 1.0d+00
+               fjac(1,4,j) = 0.0d+00
+               fjac(1,5,j) = 0.0d+00
+
+               fjac(2,1,j) = - ( u(2,i,j,k)*u(3,i,j,k) )
+     >              * tmp2
+               fjac(2,2,j) = u(3,i,j,k) * tmp1
+               fjac(2,3,j) = u(2,i,j,k) * tmp1
+               fjac(2,4,j) = 0.0d+00
+               fjac(2,5,j) = 0.0d+00
+
+               fjac(3,1,j) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)
+     >              + c2 * qs(i,j,k)
+               fjac(3,2,j) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(3,3,j) = ( 2.0d+00 - c2 )
+     >              *  u(3,i,j,k) * tmp1 
+               fjac(3,4,j) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(3,5,j) = c2
+
+               fjac(4,1,j) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(4,2,j) = 0.0d+00
+               fjac(4,3,j) = u(4,i,j,k) * tmp1
+               fjac(4,4,j) = u(3,i,j,k) * tmp1
+               fjac(4,5,j) = 0.0d+00
+
+               fjac(5,1,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * u(3,i,j,k) * tmp2
+               fjac(5,2,j) = - c2 * u(2,i,j,k)*u(3,i,j,k) 
+     >              * tmp2
+               fjac(5,3,j) = c1 * u(5,i,j,k) * tmp1 
+     >              - c2 
+     >              * ( qs(i,j,k)
+     >              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(5,4,j) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,5,j) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(1,1,j) = 0.0d+00
+               njac(1,2,j) = 0.0d+00
+               njac(1,3,j) = 0.0d+00
+               njac(1,4,j) = 0.0d+00
+               njac(1,5,j) = 0.0d+00
+
+               njac(2,1,j) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,j) =   c3c4 * tmp1
+               njac(2,3,j) =   0.0d+00
+               njac(2,4,j) =   0.0d+00
+               njac(2,5,j) =   0.0d+00
+
+               njac(3,1,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,j) =   0.0d+00
+               njac(3,3,j) =   con43 * c3c4 * tmp1
+               njac(3,4,j) =   0.0d+00
+               njac(3,5,j) =   0.0d+00
+
+               njac(4,1,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,j) =   0.0d+00
+               njac(4,3,j) =   0.0d+00
+               njac(4,4,j) =   c3c4 * tmp1
+               njac(4,5,j) =   0.0d+00
+
+               njac(5,1,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,j) = ( c1345 ) * tmp1
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, jsize)
+            do j = 1, jsize-1
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhs(1,1,aa,j) = - tmp2 * fjac(1,1,j-1)
+     >              - tmp1 * njac(1,1,j-1)
+     >              - tmp1 * dy1 
+               lhs(1,2,aa,j) = - tmp2 * fjac(1,2,j-1)
+     >              - tmp1 * njac(1,2,j-1)
+               lhs(1,3,aa,j) = - tmp2 * fjac(1,3,j-1)
+     >              - tmp1 * njac(1,3,j-1)
+               lhs(1,4,aa,j) = - tmp2 * fjac(1,4,j-1)
+     >              - tmp1 * njac(1,4,j-1)
+               lhs(1,5,aa,j) = - tmp2 * fjac(1,5,j-1)
+     >              - tmp1 * njac(1,5,j-1)
+
+               lhs(2,1,aa,j) = - tmp2 * fjac(2,1,j-1)
+     >              - tmp1 * njac(2,1,j-1)
+               lhs(2,2,aa,j) = - tmp2 * fjac(2,2,j-1)
+     >              - tmp1 * njac(2,2,j-1)
+     >              - tmp1 * dy2
+               lhs(2,3,aa,j) = - tmp2 * fjac(2,3,j-1)
+     >              - tmp1 * njac(2,3,j-1)
+               lhs(2,4,aa,j) = - tmp2 * fjac(2,4,j-1)
+     >              - tmp1 * njac(2,4,j-1)
+               lhs(2,5,aa,j) = - tmp2 * fjac(2,5,j-1)
+     >              - tmp1 * njac(2,5,j-1)
+
+               lhs(3,1,aa,j) = - tmp2 * fjac(3,1,j-1)
+     >              - tmp1 * njac(3,1,j-1)
+               lhs(3,2,aa,j) = - tmp2 * fjac(3,2,j-1)
+     >              - tmp1 * njac(3,2,j-1)
+               lhs(3,3,aa,j) = - tmp2 * fjac(3,3,j-1)
+     >              - tmp1 * njac(3,3,j-1)
+     >              - tmp1 * dy3 
+               lhs(3,4,aa,j) = - tmp2 * fjac(3,4,j-1)
+     >              - tmp1 * njac(3,4,j-1)
+               lhs(3,5,aa,j) = - tmp2 * fjac(3,5,j-1)
+     >              - tmp1 * njac(3,5,j-1)
+
+               lhs(4,1,aa,j) = - tmp2 * fjac(4,1,j-1)
+     >              - tmp1 * njac(4,1,j-1)
+               lhs(4,2,aa,j) = - tmp2 * fjac(4,2,j-1)
+     >              - tmp1 * njac(4,2,j-1)
+               lhs(4,3,aa,j) = - tmp2 * fjac(4,3,j-1)
+     >              - tmp1 * njac(4,3,j-1)
+               lhs(4,4,aa,j) = - tmp2 * fjac(4,4,j-1)
+     >              - tmp1 * njac(4,4,j-1)
+     >              - tmp1 * dy4
+               lhs(4,5,aa,j) = - tmp2 * fjac(4,5,j-1)
+     >              - tmp1 * njac(4,5,j-1)
+
+               lhs(5,1,aa,j) = - tmp2 * fjac(5,1,j-1)
+     >              - tmp1 * njac(5,1,j-1)
+               lhs(5,2,aa,j) = - tmp2 * fjac(5,2,j-1)
+     >              - tmp1 * njac(5,2,j-1)
+               lhs(5,3,aa,j) = - tmp2 * fjac(5,3,j-1)
+     >              - tmp1 * njac(5,3,j-1)
+               lhs(5,4,aa,j) = - tmp2 * fjac(5,4,j-1)
+     >              - tmp1 * njac(5,4,j-1)
+               lhs(5,5,aa,j) = - tmp2 * fjac(5,5,j-1)
+     >              - tmp1 * njac(5,5,j-1)
+     >              - tmp1 * dy5
+
+               lhs(1,1,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhs(1,2,bb,j) = tmp1 * 2.0d+00 * njac(1,2,j)
+               lhs(1,3,bb,j) = tmp1 * 2.0d+00 * njac(1,3,j)
+               lhs(1,4,bb,j) = tmp1 * 2.0d+00 * njac(1,4,j)
+               lhs(1,5,bb,j) = tmp1 * 2.0d+00 * njac(1,5,j)
+
+               lhs(2,1,bb,j) = tmp1 * 2.0d+00 * njac(2,1,j)
+               lhs(2,2,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhs(2,3,bb,j) = tmp1 * 2.0d+00 * njac(2,3,j)
+               lhs(2,4,bb,j) = tmp1 * 2.0d+00 * njac(2,4,j)
+               lhs(2,5,bb,j) = tmp1 * 2.0d+00 * njac(2,5,j)
+
+               lhs(3,1,bb,j) = tmp1 * 2.0d+00 * njac(3,1,j)
+               lhs(3,2,bb,j) = tmp1 * 2.0d+00 * njac(3,2,j)
+               lhs(3,3,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhs(3,4,bb,j) = tmp1 * 2.0d+00 * njac(3,4,j)
+               lhs(3,5,bb,j) = tmp1 * 2.0d+00 * njac(3,5,j)
+
+               lhs(4,1,bb,j) = tmp1 * 2.0d+00 * njac(4,1,j)
+               lhs(4,2,bb,j) = tmp1 * 2.0d+00 * njac(4,2,j)
+               lhs(4,3,bb,j) = tmp1 * 2.0d+00 * njac(4,3,j)
+               lhs(4,4,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhs(4,5,bb,j) = tmp1 * 2.0d+00 * njac(4,5,j)
+
+               lhs(5,1,bb,j) = tmp1 * 2.0d+00 * njac(5,1,j)
+               lhs(5,2,bb,j) = tmp1 * 2.0d+00 * njac(5,2,j)
+               lhs(5,3,bb,j) = tmp1 * 2.0d+00 * njac(5,3,j)
+               lhs(5,4,bb,j) = tmp1 * 2.0d+00 * njac(5,4,j)
+               lhs(5,5,bb,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhs(1,1,cc,j) =  tmp2 * fjac(1,1,j+1)
+     >              - tmp1 * njac(1,1,j+1)
+     >              - tmp1 * dy1
+               lhs(1,2,cc,j) =  tmp2 * fjac(1,2,j+1)
+     >              - tmp1 * njac(1,2,j+1)
+               lhs(1,3,cc,j) =  tmp2 * fjac(1,3,j+1)
+     >              - tmp1 * njac(1,3,j+1)
+               lhs(1,4,cc,j) =  tmp2 * fjac(1,4,j+1)
+     >              - tmp1 * njac(1,4,j+1)
+               lhs(1,5,cc,j) =  tmp2 * fjac(1,5,j+1)
+     >              - tmp1 * njac(1,5,j+1)
+
+               lhs(2,1,cc,j) =  tmp2 * fjac(2,1,j+1)
+     >              - tmp1 * njac(2,1,j+1)
+               lhs(2,2,cc,j) =  tmp2 * fjac(2,2,j+1)
+     >              - tmp1 * njac(2,2,j+1)
+     >              - tmp1 * dy2
+               lhs(2,3,cc,j) =  tmp2 * fjac(2,3,j+1)
+     >              - tmp1 * njac(2,3,j+1)
+               lhs(2,4,cc,j) =  tmp2 * fjac(2,4,j+1)
+     >              - tmp1 * njac(2,4,j+1)
+               lhs(2,5,cc,j) =  tmp2 * fjac(2,5,j+1)
+     >              - tmp1 * njac(2,5,j+1)
+
+               lhs(3,1,cc,j) =  tmp2 * fjac(3,1,j+1)
+     >              - tmp1 * njac(3,1,j+1)
+               lhs(3,2,cc,j) =  tmp2 * fjac(3,2,j+1)
+     >              - tmp1 * njac(3,2,j+1)
+               lhs(3,3,cc,j) =  tmp2 * fjac(3,3,j+1)
+     >              - tmp1 * njac(3,3,j+1)
+     >              - tmp1 * dy3
+               lhs(3,4,cc,j) =  tmp2 * fjac(3,4,j+1)
+     >              - tmp1 * njac(3,4,j+1)
+               lhs(3,5,cc,j) =  tmp2 * fjac(3,5,j+1)
+     >              - tmp1 * njac(3,5,j+1)
+
+               lhs(4,1,cc,j) =  tmp2 * fjac(4,1,j+1)
+     >              - tmp1 * njac(4,1,j+1)
+               lhs(4,2,cc,j) =  tmp2 * fjac(4,2,j+1)
+     >              - tmp1 * njac(4,2,j+1)
+               lhs(4,3,cc,j) =  tmp2 * fjac(4,3,j+1)
+     >              - tmp1 * njac(4,3,j+1)
+               lhs(4,4,cc,j) =  tmp2 * fjac(4,4,j+1)
+     >              - tmp1 * njac(4,4,j+1)
+     >              - tmp1 * dy4
+               lhs(4,5,cc,j) =  tmp2 * fjac(4,5,j+1)
+     >              - tmp1 * njac(4,5,j+1)
+
+               lhs(5,1,cc,j) =  tmp2 * fjac(5,1,j+1)
+     >              - tmp1 * njac(5,1,j+1)
+               lhs(5,2,cc,j) =  tmp2 * fjac(5,2,j+1)
+     >              - tmp1 * njac(5,2,j+1)
+               lhs(5,3,cc,j) =  tmp2 * fjac(5,3,j+1)
+     >              - tmp1 * njac(5,3,j+1)
+               lhs(5,4,cc,j) =  tmp2 * fjac(5,4,j+1)
+     >              - tmp1 * njac(5,4,j+1)
+               lhs(5,5,cc,j) =  tmp2 * fjac(5,5,j+1)
+     >              - tmp1 * njac(5,5,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,0,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,i,0,k) )
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do j=1,jsize-1
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,j),
+     >                         rhs(1,i,j-1,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,j),
+     >                         lhs(1,1,cc,j-1),
+     >                         lhs(1,1,bb,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,j),
+     >                        lhs(1,1,cc,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,jsize),
+     >                         rhs(1,i,jsize-1,k),rhs(1,i,jsize,k))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,jsize),
+     >                         lhs(1,1,cc,jsize-1),
+     >                         lhs(1,1,bb,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,jsize),
+     >                       rhs(1,i,jsize,k) )
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+      
+            do j=jsize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,j)*rhs(n,i,j+1,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve_vec.f
new file mode 100644
index 0000000..fa9aa1e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/y_solve_vec.f
@@ -0,0 +1,430 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Y direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i, j, k, m, n, jsize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three y-factors 
+c---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the tri-diagonal matrix;
+c     determine a (labeled f) and n jacobians for cell c
+c---------------------------------------------------------------------
+      do k = 1, grid_points(3)-2
+         do j = 0, jsize
+            do i = 1, grid_points(1)-2
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 0.0d+00
+               fjac(1,3,i,j) = 1.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = - ( u(2,i,j,k)*u(3,i,j,k) )
+     >              * tmp2
+               fjac(2,2,i,j) = u(3,i,j,k) * tmp1
+               fjac(2,3,i,j) = u(2,i,j,k) * tmp1
+               fjac(2,4,i,j) = 0.0d+00
+               fjac(2,5,i,j) = 0.0d+00
+
+               fjac(3,1,i,j) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)
+     >              + c2 * qs(i,j,k)
+               fjac(3,2,i,j) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(3,3,i,j) = ( 2.0d+00 - c2 )
+     >              *  u(3,i,j,k) * tmp1 
+               fjac(3,4,i,j) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(3,5,i,j) = c2
+
+               fjac(4,1,i,j) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(4,2,i,j) = 0.0d+00
+               fjac(4,3,i,j) = u(4,i,j,k) * tmp1
+               fjac(4,4,i,j) = u(3,i,j,k) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * square(i,j,k)
+     >              - c1 * u(5,i,j,k) )
+     >              * u(3,i,j,k) * tmp2
+               fjac(5,2,i,j) = - c2 * u(2,i,j,k)*u(3,i,j,k) 
+     >              * tmp2
+               fjac(5,3,i,j) = c1 * u(5,i,j,k) * tmp1 
+     >              - c2 
+     >              * ( qs(i,j,k)
+     >              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(5,4,i,j) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,5,i,j) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,j) =   c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   con43 * c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,j) =   0.0d+00
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,j) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,i,0) = 0.0d0
+                  lhs(m,n,bb,i,0) = 0.0d0
+                  lhs(m,n,cc,i,0) = 0.0d0
+                  lhs(m,n,aa,i,jsize) = 0.0d0
+                  lhs(m,n,bb,i,jsize) = 0.0d0
+                  lhs(m,n,cc,i,jsize) = 0.0d0
+               end do
+               lhs(m,m,bb,i,0) = 1.0d0
+               lhs(m,m,bb,i,jsize) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now joacobians set, so form left hand side in y direction
+c---------------------------------------------------------------------
+         do j = 1, jsize-1
+            do i = 1, grid_points(1)-2
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhs(1,1,aa,i,j) = - tmp2 * fjac(1,1,i,j-1)
+     >              - tmp1 * njac(1,1,i,j-1)
+     >              - tmp1 * dy1 
+               lhs(1,2,aa,i,j) = - tmp2 * fjac(1,2,i,j-1)
+     >              - tmp1 * njac(1,2,i,j-1)
+               lhs(1,3,aa,i,j) = - tmp2 * fjac(1,3,i,j-1)
+     >              - tmp1 * njac(1,3,i,j-1)
+               lhs(1,4,aa,i,j) = - tmp2 * fjac(1,4,i,j-1)
+     >              - tmp1 * njac(1,4,i,j-1)
+               lhs(1,5,aa,i,j) = - tmp2 * fjac(1,5,i,j-1)
+     >              - tmp1 * njac(1,5,i,j-1)
+
+               lhs(2,1,aa,i,j) = - tmp2 * fjac(2,1,i,j-1)
+     >              - tmp1 * njac(2,1,i,j-1)
+               lhs(2,2,aa,i,j) = - tmp2 * fjac(2,2,i,j-1)
+     >              - tmp1 * njac(2,2,i,j-1)
+     >              - tmp1 * dy2
+               lhs(2,3,aa,i,j) = - tmp2 * fjac(2,3,i,j-1)
+     >              - tmp1 * njac(2,3,i,j-1)
+               lhs(2,4,aa,i,j) = - tmp2 * fjac(2,4,i,j-1)
+     >              - tmp1 * njac(2,4,i,j-1)
+               lhs(2,5,aa,i,j) = - tmp2 * fjac(2,5,i,j-1)
+     >              - tmp1 * njac(2,5,i,j-1)
+
+               lhs(3,1,aa,i,j) = - tmp2 * fjac(3,1,i,j-1)
+     >              - tmp1 * njac(3,1,i,j-1)
+               lhs(3,2,aa,i,j) = - tmp2 * fjac(3,2,i,j-1)
+     >              - tmp1 * njac(3,2,i,j-1)
+               lhs(3,3,aa,i,j) = - tmp2 * fjac(3,3,i,j-1)
+     >              - tmp1 * njac(3,3,i,j-1)
+     >              - tmp1 * dy3 
+               lhs(3,4,aa,i,j) = - tmp2 * fjac(3,4,i,j-1)
+     >              - tmp1 * njac(3,4,i,j-1)
+               lhs(3,5,aa,i,j) = - tmp2 * fjac(3,5,i,j-1)
+     >              - tmp1 * njac(3,5,i,j-1)
+
+               lhs(4,1,aa,i,j) = - tmp2 * fjac(4,1,i,j-1)
+     >              - tmp1 * njac(4,1,i,j-1)
+               lhs(4,2,aa,i,j) = - tmp2 * fjac(4,2,i,j-1)
+     >              - tmp1 * njac(4,2,i,j-1)
+               lhs(4,3,aa,i,j) = - tmp2 * fjac(4,3,i,j-1)
+     >              - tmp1 * njac(4,3,i,j-1)
+               lhs(4,4,aa,i,j) = - tmp2 * fjac(4,4,i,j-1)
+     >              - tmp1 * njac(4,4,i,j-1)
+     >              - tmp1 * dy4
+               lhs(4,5,aa,i,j) = - tmp2 * fjac(4,5,i,j-1)
+     >              - tmp1 * njac(4,5,i,j-1)
+
+               lhs(5,1,aa,i,j) = - tmp2 * fjac(5,1,i,j-1)
+     >              - tmp1 * njac(5,1,i,j-1)
+               lhs(5,2,aa,i,j) = - tmp2 * fjac(5,2,i,j-1)
+     >              - tmp1 * njac(5,2,i,j-1)
+               lhs(5,3,aa,i,j) = - tmp2 * fjac(5,3,i,j-1)
+     >              - tmp1 * njac(5,3,i,j-1)
+               lhs(5,4,aa,i,j) = - tmp2 * fjac(5,4,i,j-1)
+     >              - tmp1 * njac(5,4,i,j-1)
+               lhs(5,5,aa,i,j) = - tmp2 * fjac(5,5,i,j-1)
+     >              - tmp1 * njac(5,5,i,j-1)
+     >              - tmp1 * dy5
+
+               lhs(1,1,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,j)
+     >              + tmp1 * 2.0d+00 * dy1
+               lhs(1,2,bb,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhs(1,3,bb,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhs(1,4,bb,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhs(1,5,bb,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhs(2,1,bb,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhs(2,2,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,j)
+     >              + tmp1 * 2.0d+00 * dy2
+               lhs(2,3,bb,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhs(2,4,bb,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhs(2,5,bb,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhs(3,1,bb,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhs(3,2,bb,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhs(3,3,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,j)
+     >              + tmp1 * 2.0d+00 * dy3
+               lhs(3,4,bb,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhs(3,5,bb,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhs(4,1,bb,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhs(4,2,bb,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhs(4,3,bb,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhs(4,4,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,j)
+     >              + tmp1 * 2.0d+00 * dy4
+               lhs(4,5,bb,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhs(5,1,bb,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhs(5,2,bb,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhs(5,3,bb,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhs(5,4,bb,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhs(5,5,bb,i,j) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,j) 
+     >              + tmp1 * 2.0d+00 * dy5
+
+               lhs(1,1,cc,i,j) =  tmp2 * fjac(1,1,i,j+1)
+     >              - tmp1 * njac(1,1,i,j+1)
+     >              - tmp1 * dy1
+               lhs(1,2,cc,i,j) =  tmp2 * fjac(1,2,i,j+1)
+     >              - tmp1 * njac(1,2,i,j+1)
+               lhs(1,3,cc,i,j) =  tmp2 * fjac(1,3,i,j+1)
+     >              - tmp1 * njac(1,3,i,j+1)
+               lhs(1,4,cc,i,j) =  tmp2 * fjac(1,4,i,j+1)
+     >              - tmp1 * njac(1,4,i,j+1)
+               lhs(1,5,cc,i,j) =  tmp2 * fjac(1,5,i,j+1)
+     >              - tmp1 * njac(1,5,i,j+1)
+
+               lhs(2,1,cc,i,j) =  tmp2 * fjac(2,1,i,j+1)
+     >              - tmp1 * njac(2,1,i,j+1)
+               lhs(2,2,cc,i,j) =  tmp2 * fjac(2,2,i,j+1)
+     >              - tmp1 * njac(2,2,i,j+1)
+     >              - tmp1 * dy2
+               lhs(2,3,cc,i,j) =  tmp2 * fjac(2,3,i,j+1)
+     >              - tmp1 * njac(2,3,i,j+1)
+               lhs(2,4,cc,i,j) =  tmp2 * fjac(2,4,i,j+1)
+     >              - tmp1 * njac(2,4,i,j+1)
+               lhs(2,5,cc,i,j) =  tmp2 * fjac(2,5,i,j+1)
+     >              - tmp1 * njac(2,5,i,j+1)
+
+               lhs(3,1,cc,i,j) =  tmp2 * fjac(3,1,i,j+1)
+     >              - tmp1 * njac(3,1,i,j+1)
+               lhs(3,2,cc,i,j) =  tmp2 * fjac(3,2,i,j+1)
+     >              - tmp1 * njac(3,2,i,j+1)
+               lhs(3,3,cc,i,j) =  tmp2 * fjac(3,3,i,j+1)
+     >              - tmp1 * njac(3,3,i,j+1)
+     >              - tmp1 * dy3
+               lhs(3,4,cc,i,j) =  tmp2 * fjac(3,4,i,j+1)
+     >              - tmp1 * njac(3,4,i,j+1)
+               lhs(3,5,cc,i,j) =  tmp2 * fjac(3,5,i,j+1)
+     >              - tmp1 * njac(3,5,i,j+1)
+
+               lhs(4,1,cc,i,j) =  tmp2 * fjac(4,1,i,j+1)
+     >              - tmp1 * njac(4,1,i,j+1)
+               lhs(4,2,cc,i,j) =  tmp2 * fjac(4,2,i,j+1)
+     >              - tmp1 * njac(4,2,i,j+1)
+               lhs(4,3,cc,i,j) =  tmp2 * fjac(4,3,i,j+1)
+     >              - tmp1 * njac(4,3,i,j+1)
+               lhs(4,4,cc,i,j) =  tmp2 * fjac(4,4,i,j+1)
+     >              - tmp1 * njac(4,4,i,j+1)
+     >              - tmp1 * dy4
+               lhs(4,5,cc,i,j) =  tmp2 * fjac(4,5,i,j+1)
+     >              - tmp1 * njac(4,5,i,j+1)
+
+               lhs(5,1,cc,i,j) =  tmp2 * fjac(5,1,i,j+1)
+     >              - tmp1 * njac(5,1,i,j+1)
+               lhs(5,2,cc,i,j) =  tmp2 * fjac(5,2,i,j+1)
+     >              - tmp1 * njac(5,2,i,j+1)
+               lhs(5,3,cc,i,j) =  tmp2 * fjac(5,3,i,j+1)
+     >              - tmp1 * njac(5,3,i,j+1)
+               lhs(5,4,cc,i,j) =  tmp2 * fjac(5,4,i,j+1)
+     >              - tmp1 * njac(5,4,i,j+1)
+               lhs(5,5,cc,i,j) =  tmp2 * fjac(5,5,i,j+1)
+     >              - tmp1 * njac(5,5,i,j+1)
+     >              - tmp1 * dy5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,0,k) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call binvcrhs( lhs(1,1,bb,i,0),
+     >                        lhs(1,1,cc,i,0),
+     >                        rhs(1,i,0,k) )
+         enddo
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do j=1,jsize-1
+!dir$ ivdep
+            do i = 1, grid_points(1)-2
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(j-1) from lhs_vector(j)
+c     
+c     rhs(j) = rhs(j) - A*rhs(j-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,j),
+     >                         rhs(1,i,j-1,k),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(j) = B(j) - C(j-1)*A(j)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,j),
+     >                         lhs(1,1,cc,i,j-1),
+     >                         lhs(1,1,bb,i,j))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,j),
+     >                        lhs(1,1,cc,i,j),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+
+c---------------------------------------------------------------------
+c     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call matvec_sub(lhs(1,1,aa,i,jsize),
+     >                         rhs(1,i,jsize-1,k),rhs(1,i,jsize,k))
+
+c---------------------------------------------------------------------
+c     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+c     call matmul_sub(aa,i,jsize,k,c,
+c     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,i,jsize),
+     >                         lhs(1,1,cc,i,jsize-1),
+     >                         lhs(1,1,bb,i,jsize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,i,jsize),
+     >                       rhs(1,i,jsize,k) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+c     else assume U(jsize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(jstart) will be sent to next cell
+c---------------------------------------------------------------------
+      
+         do j=jsize-1,0,-1
+            do i = 1, grid_points(1)-2
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,j)*rhs(n,i,j+1,k)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve.f
new file mode 100644
index 0000000..1d1906f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve.f
@@ -0,0 +1,410 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs.h'
+
+      integer i, j, k, m, n, ksize
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians
+c---------------------------------------------------------------------
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do k = 0, ksize
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,k) = 0.0d+00
+               fjac(1,2,k) = 0.0d+00
+               fjac(1,3,k) = 0.0d+00
+               fjac(1,4,k) = 1.0d+00
+               fjac(1,5,k) = 0.0d+00
+
+               fjac(2,1,k) = - ( u(2,i,j,k)*u(4,i,j,k) ) 
+     >              * tmp2 
+               fjac(2,2,k) = u(4,i,j,k) * tmp1
+               fjac(2,3,k) = 0.0d+00
+               fjac(2,4,k) = u(2,i,j,k) * tmp1
+               fjac(2,5,k) = 0.0d+00
+
+               fjac(3,1,k) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(3,2,k) = 0.0d+00
+               fjac(3,3,k) = u(4,i,j,k) * tmp1
+               fjac(3,4,k) = u(3,i,j,k) * tmp1
+               fjac(3,5,k) = 0.0d+00
+
+               fjac(4,1,k) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 ) 
+     >              + c2 * qs(i,j,k)
+               fjac(4,2,k) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(4,3,k) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(4,4,k) = ( 2.0d+00 - c2 )
+     >              *  u(4,i,j,k) * tmp1 
+               fjac(4,5,k) = c2
+
+               fjac(5,1,k) = ( c2 * 2.0d0 * square(i,j,k) 
+     >              - c1 * u(5,i,j,k) )
+     >              * u(4,i,j,k) * tmp2
+               fjac(5,2,k) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(5,3,k) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,4,k) = c1 * ( u(5,i,j,k) * tmp1 )
+     >              - c2
+     >              * ( qs(i,j,k)
+     >              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(5,5,k) = c1 * u(4,i,j,k) * tmp1
+
+               njac(1,1,k) = 0.0d+00
+               njac(1,2,k) = 0.0d+00
+               njac(1,3,k) = 0.0d+00
+               njac(1,4,k) = 0.0d+00
+               njac(1,5,k) = 0.0d+00
+
+               njac(2,1,k) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,k) =   c3c4 * tmp1
+               njac(2,3,k) =   0.0d+00
+               njac(2,4,k) =   0.0d+00
+               njac(2,5,k) =   0.0d+00
+
+               njac(3,1,k) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,k) =   0.0d+00
+               njac(3,3,k) =   c3c4 * tmp1
+               njac(3,4,k) =   0.0d+00
+               njac(3,5,k) =   0.0d+00
+
+               njac(4,1,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,k) =   0.0d+00
+               njac(4,3,k) =   0.0d+00
+               njac(4,4,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,k) =   0.0d+00
+
+               njac(5,1,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,k) = ( c1345 )* tmp1
+
+
+            enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+            call lhsinit(lhs, ksize)
+            do k = 1, ksize-1
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhs(1,1,aa,k) = - tmp2 * fjac(1,1,k-1)
+     >              - tmp1 * njac(1,1,k-1)
+     >              - tmp1 * dz1 
+               lhs(1,2,aa,k) = - tmp2 * fjac(1,2,k-1)
+     >              - tmp1 * njac(1,2,k-1)
+               lhs(1,3,aa,k) = - tmp2 * fjac(1,3,k-1)
+     >              - tmp1 * njac(1,3,k-1)
+               lhs(1,4,aa,k) = - tmp2 * fjac(1,4,k-1)
+     >              - tmp1 * njac(1,4,k-1)
+               lhs(1,5,aa,k) = - tmp2 * fjac(1,5,k-1)
+     >              - tmp1 * njac(1,5,k-1)
+
+               lhs(2,1,aa,k) = - tmp2 * fjac(2,1,k-1)
+     >              - tmp1 * njac(2,1,k-1)
+               lhs(2,2,aa,k) = - tmp2 * fjac(2,2,k-1)
+     >              - tmp1 * njac(2,2,k-1)
+     >              - tmp1 * dz2
+               lhs(2,3,aa,k) = - tmp2 * fjac(2,3,k-1)
+     >              - tmp1 * njac(2,3,k-1)
+               lhs(2,4,aa,k) = - tmp2 * fjac(2,4,k-1)
+     >              - tmp1 * njac(2,4,k-1)
+               lhs(2,5,aa,k) = - tmp2 * fjac(2,5,k-1)
+     >              - tmp1 * njac(2,5,k-1)
+
+               lhs(3,1,aa,k) = - tmp2 * fjac(3,1,k-1)
+     >              - tmp1 * njac(3,1,k-1)
+               lhs(3,2,aa,k) = - tmp2 * fjac(3,2,k-1)
+     >              - tmp1 * njac(3,2,k-1)
+               lhs(3,3,aa,k) = - tmp2 * fjac(3,3,k-1)
+     >              - tmp1 * njac(3,3,k-1)
+     >              - tmp1 * dz3 
+               lhs(3,4,aa,k) = - tmp2 * fjac(3,4,k-1)
+     >              - tmp1 * njac(3,4,k-1)
+               lhs(3,5,aa,k) = - tmp2 * fjac(3,5,k-1)
+     >              - tmp1 * njac(3,5,k-1)
+
+               lhs(4,1,aa,k) = - tmp2 * fjac(4,1,k-1)
+     >              - tmp1 * njac(4,1,k-1)
+               lhs(4,2,aa,k) = - tmp2 * fjac(4,2,k-1)
+     >              - tmp1 * njac(4,2,k-1)
+               lhs(4,3,aa,k) = - tmp2 * fjac(4,3,k-1)
+     >              - tmp1 * njac(4,3,k-1)
+               lhs(4,4,aa,k) = - tmp2 * fjac(4,4,k-1)
+     >              - tmp1 * njac(4,4,k-1)
+     >              - tmp1 * dz4
+               lhs(4,5,aa,k) = - tmp2 * fjac(4,5,k-1)
+     >              - tmp1 * njac(4,5,k-1)
+
+               lhs(5,1,aa,k) = - tmp2 * fjac(5,1,k-1)
+     >              - tmp1 * njac(5,1,k-1)
+               lhs(5,2,aa,k) = - tmp2 * fjac(5,2,k-1)
+     >              - tmp1 * njac(5,2,k-1)
+               lhs(5,3,aa,k) = - tmp2 * fjac(5,3,k-1)
+     >              - tmp1 * njac(5,3,k-1)
+               lhs(5,4,aa,k) = - tmp2 * fjac(5,4,k-1)
+     >              - tmp1 * njac(5,4,k-1)
+               lhs(5,5,aa,k) = - tmp2 * fjac(5,5,k-1)
+     >              - tmp1 * njac(5,5,k-1)
+     >              - tmp1 * dz5
+
+               lhs(1,1,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhs(1,2,bb,k) = tmp1 * 2.0d+00 * njac(1,2,k)
+               lhs(1,3,bb,k) = tmp1 * 2.0d+00 * njac(1,3,k)
+               lhs(1,4,bb,k) = tmp1 * 2.0d+00 * njac(1,4,k)
+               lhs(1,5,bb,k) = tmp1 * 2.0d+00 * njac(1,5,k)
+
+               lhs(2,1,bb,k) = tmp1 * 2.0d+00 * njac(2,1,k)
+               lhs(2,2,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhs(2,3,bb,k) = tmp1 * 2.0d+00 * njac(2,3,k)
+               lhs(2,4,bb,k) = tmp1 * 2.0d+00 * njac(2,4,k)
+               lhs(2,5,bb,k) = tmp1 * 2.0d+00 * njac(2,5,k)
+
+               lhs(3,1,bb,k) = tmp1 * 2.0d+00 * njac(3,1,k)
+               lhs(3,2,bb,k) = tmp1 * 2.0d+00 * njac(3,2,k)
+               lhs(3,3,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhs(3,4,bb,k) = tmp1 * 2.0d+00 * njac(3,4,k)
+               lhs(3,5,bb,k) = tmp1 * 2.0d+00 * njac(3,5,k)
+
+               lhs(4,1,bb,k) = tmp1 * 2.0d+00 * njac(4,1,k)
+               lhs(4,2,bb,k) = tmp1 * 2.0d+00 * njac(4,2,k)
+               lhs(4,3,bb,k) = tmp1 * 2.0d+00 * njac(4,3,k)
+               lhs(4,4,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhs(4,5,bb,k) = tmp1 * 2.0d+00 * njac(4,5,k)
+
+               lhs(5,1,bb,k) = tmp1 * 2.0d+00 * njac(5,1,k)
+               lhs(5,2,bb,k) = tmp1 * 2.0d+00 * njac(5,2,k)
+               lhs(5,3,bb,k) = tmp1 * 2.0d+00 * njac(5,3,k)
+               lhs(5,4,bb,k) = tmp1 * 2.0d+00 * njac(5,4,k)
+               lhs(5,5,bb,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhs(1,1,cc,k) =  tmp2 * fjac(1,1,k+1)
+     >              - tmp1 * njac(1,1,k+1)
+     >              - tmp1 * dz1
+               lhs(1,2,cc,k) =  tmp2 * fjac(1,2,k+1)
+     >              - tmp1 * njac(1,2,k+1)
+               lhs(1,3,cc,k) =  tmp2 * fjac(1,3,k+1)
+     >              - tmp1 * njac(1,3,k+1)
+               lhs(1,4,cc,k) =  tmp2 * fjac(1,4,k+1)
+     >              - tmp1 * njac(1,4,k+1)
+               lhs(1,5,cc,k) =  tmp2 * fjac(1,5,k+1)
+     >              - tmp1 * njac(1,5,k+1)
+
+               lhs(2,1,cc,k) =  tmp2 * fjac(2,1,k+1)
+     >              - tmp1 * njac(2,1,k+1)
+               lhs(2,2,cc,k) =  tmp2 * fjac(2,2,k+1)
+     >              - tmp1 * njac(2,2,k+1)
+     >              - tmp1 * dz2
+               lhs(2,3,cc,k) =  tmp2 * fjac(2,3,k+1)
+     >              - tmp1 * njac(2,3,k+1)
+               lhs(2,4,cc,k) =  tmp2 * fjac(2,4,k+1)
+     >              - tmp1 * njac(2,4,k+1)
+               lhs(2,5,cc,k) =  tmp2 * fjac(2,5,k+1)
+     >              - tmp1 * njac(2,5,k+1)
+
+               lhs(3,1,cc,k) =  tmp2 * fjac(3,1,k+1)
+     >              - tmp1 * njac(3,1,k+1)
+               lhs(3,2,cc,k) =  tmp2 * fjac(3,2,k+1)
+     >              - tmp1 * njac(3,2,k+1)
+               lhs(3,3,cc,k) =  tmp2 * fjac(3,3,k+1)
+     >              - tmp1 * njac(3,3,k+1)
+     >              - tmp1 * dz3
+               lhs(3,4,cc,k) =  tmp2 * fjac(3,4,k+1)
+     >              - tmp1 * njac(3,4,k+1)
+               lhs(3,5,cc,k) =  tmp2 * fjac(3,5,k+1)
+     >              - tmp1 * njac(3,5,k+1)
+
+               lhs(4,1,cc,k) =  tmp2 * fjac(4,1,k+1)
+     >              - tmp1 * njac(4,1,k+1)
+               lhs(4,2,cc,k) =  tmp2 * fjac(4,2,k+1)
+     >              - tmp1 * njac(4,2,k+1)
+               lhs(4,3,cc,k) =  tmp2 * fjac(4,3,k+1)
+     >              - tmp1 * njac(4,3,k+1)
+               lhs(4,4,cc,k) =  tmp2 * fjac(4,4,k+1)
+     >              - tmp1 * njac(4,4,k+1)
+     >              - tmp1 * dz4
+               lhs(4,5,cc,k) =  tmp2 * fjac(4,5,k+1)
+     >              - tmp1 * njac(4,5,k+1)
+
+               lhs(5,1,cc,k) =  tmp2 * fjac(5,1,k+1)
+     >              - tmp1 * njac(5,1,k+1)
+               lhs(5,2,cc,k) =  tmp2 * fjac(5,2,k+1)
+     >              - tmp1 * njac(5,2,k+1)
+               lhs(5,3,cc,k) =  tmp2 * fjac(5,3,k+1)
+     >              - tmp1 * njac(5,3,k+1)
+               lhs(5,4,cc,k) =  tmp2 * fjac(5,4,k+1)
+     >              - tmp1 * njac(5,4,k+1)
+               lhs(5,5,cc,k) =  tmp2 * fjac(5,5,k+1)
+     >              - tmp1 * njac(5,5,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,0) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),
+     >                        lhs(1,1,cc,0),
+     >                        rhs(1,i,j,0) )
+
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+            do k=1,ksize-1
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,k),
+     >                         rhs(1,i,j,k-1),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,k),
+     >                         lhs(1,1,cc,k-1),
+     >                         lhs(1,1,bb,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,k),
+     >                        lhs(1,1,cc,k),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,ksize),
+     >                         rhs(1,i,j,ksize-1),rhs(1,i,j,ksize))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,ksize),
+     >                         lhs(1,1,cc,ksize-1),
+     >                         lhs(1,1,bb,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,ksize),
+     >                       rhs(1,i,j,ksize) )
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+            do k=ksize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,k)*rhs(n,i,j,k+1)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve_vec.f
new file mode 100644
index 0000000..a0bd4b6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/BT/z_solve_vec.f
@@ -0,0 +1,441 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     Performs line solves in Z direction by first factoring
+c     the block-tridiagonal matrix into an upper triangular matrix, 
+c     and then performing back substitution to solve for the unknow
+c     vectors of each line.  
+c     
+c     Make sure we treat elements zero to cell_size in the direction
+c     of the sweep.
+c---------------------------------------------------------------------
+
+      include 'header.h'
+      include 'work_lhs_vec.h'
+
+      integer i, j, k, m, n, ksize
+      
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     This function computes the left hand side for the three z-factors 
+c---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+c---------------------------------------------------------------------
+c     Compute the indices for storing the block-diagonal matrix;
+c     determine c (labeled f) and s jacobians
+c---------------------------------------------------------------------
+      do j = 1, grid_points(2)-2
+         do k = 0, ksize
+            do i = 1, grid_points(1)-2
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,k) = 0.0d+00
+               fjac(1,2,i,k) = 0.0d+00
+               fjac(1,3,i,k) = 0.0d+00
+               fjac(1,4,i,k) = 1.0d+00
+               fjac(1,5,i,k) = 0.0d+00
+
+               fjac(2,1,i,k) = - ( u(2,i,j,k)*u(4,i,j,k) ) 
+     >              * tmp2 
+               fjac(2,2,i,k) = u(4,i,j,k) * tmp1
+               fjac(2,3,i,k) = 0.0d+00
+               fjac(2,4,i,k) = u(2,i,j,k) * tmp1
+               fjac(2,5,i,k) = 0.0d+00
+
+               fjac(3,1,i,k) = - ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(3,2,i,k) = 0.0d+00
+               fjac(3,3,i,k) = u(4,i,j,k) * tmp1
+               fjac(3,4,i,k) = u(3,i,j,k) * tmp1
+               fjac(3,5,i,k) = 0.0d+00
+
+               fjac(4,1,i,k) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 ) 
+     >              + c2 * qs(i,j,k)
+               fjac(4,2,i,k) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(4,3,i,k) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(4,4,i,k) = ( 2.0d+00 - c2 )
+     >              *  u(4,i,j,k) * tmp1 
+               fjac(4,5,i,k) = c2
+
+               fjac(5,1,i,k) = ( c2 * 2.0d0 * square(i,j,k) 
+     >              - c1 * u(5,i,j,k) )
+     >              * u(4,i,j,k) * tmp2
+               fjac(5,2,i,k) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )
+     >              * tmp2 
+               fjac(5,3,i,k) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )
+     >              * tmp2
+               fjac(5,4,i,k) = c1 * ( u(5,i,j,k) * tmp1 )
+     >              - c2
+     >              * ( qs(i,j,k)
+     >              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(5,5,i,k) = c1 * u(4,i,j,k) * tmp1
+
+               njac(1,1,i,k) = 0.0d+00
+               njac(1,2,i,k) = 0.0d+00
+               njac(1,3,i,k) = 0.0d+00
+               njac(1,4,i,k) = 0.0d+00
+               njac(1,5,i,k) = 0.0d+00
+
+               njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i,k) =   c3c4 * tmp1
+               njac(2,3,i,k) =   0.0d+00
+               njac(2,4,i,k) =   0.0d+00
+               njac(2,5,i,k) =   0.0d+00
+
+               njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i,k) =   0.0d+00
+               njac(3,3,i,k) =   c3c4 * tmp1
+               njac(3,4,i,k) =   0.0d+00
+               njac(3,5,i,k) =   0.0d+00
+
+               njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i,k) =   0.0d+00
+               njac(4,3,i,k) =   0.0d+00
+               njac(4,4,i,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,i,k) =   0.0d+00
+
+               njac(5,1,i,k) = - (  c3c4
+     >              - c1345 ) * tmp3 * (u(2,i,j,k)**2)
+     >              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)
+     >              - ( con43 * c3c4
+     >              - c1345 ) * tmp3 * (u(4,i,j,k)**2)
+     >              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i,k) = ( con43 * c3c4
+     >              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i,k) = ( c1345 )* tmp1
+
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     zero the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+         do i = 1, grid_points(1)-2
+            do m = 1, 5
+               do n = 1, 5
+                  lhs(m,n,aa,i,0) = 0.0d0
+                  lhs(m,n,bb,i,0) = 0.0d0
+                  lhs(m,n,cc,i,0) = 0.0d0
+                  lhs(m,n,aa,i,ksize) = 0.0d0
+                  lhs(m,n,bb,i,ksize) = 0.0d0
+                  lhs(m,n,cc,i,ksize) = 0.0d0
+               end do
+               lhs(m,m,bb,i,0) = 1.0d0
+               lhs(m,m,bb,i,ksize) = 1.0d0
+            end do
+         enddo
+
+c---------------------------------------------------------------------
+c     now jacobians set, so form left hand side in z direction
+c---------------------------------------------------------------------
+         do k = 1, ksize-1
+            do i = 1, grid_points(1)-2
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhs(1,1,aa,i,k) = - tmp2 * fjac(1,1,i,k-1)
+     >              - tmp1 * njac(1,1,i,k-1)
+     >              - tmp1 * dz1 
+               lhs(1,2,aa,i,k) = - tmp2 * fjac(1,2,i,k-1)
+     >              - tmp1 * njac(1,2,i,k-1)
+               lhs(1,3,aa,i,k) = - tmp2 * fjac(1,3,i,k-1)
+     >              - tmp1 * njac(1,3,i,k-1)
+               lhs(1,4,aa,i,k) = - tmp2 * fjac(1,4,i,k-1)
+     >              - tmp1 * njac(1,4,i,k-1)
+               lhs(1,5,aa,i,k) = - tmp2 * fjac(1,5,i,k-1)
+     >              - tmp1 * njac(1,5,i,k-1)
+
+               lhs(2,1,aa,i,k) = - tmp2 * fjac(2,1,i,k-1)
+     >              - tmp1 * njac(2,1,i,k-1)
+               lhs(2,2,aa,i,k) = - tmp2 * fjac(2,2,i,k-1)
+     >              - tmp1 * njac(2,2,i,k-1)
+     >              - tmp1 * dz2
+               lhs(2,3,aa,i,k) = - tmp2 * fjac(2,3,i,k-1)
+     >              - tmp1 * njac(2,3,i,k-1)
+               lhs(2,4,aa,i,k) = - tmp2 * fjac(2,4,i,k-1)
+     >              - tmp1 * njac(2,4,i,k-1)
+               lhs(2,5,aa,i,k) = - tmp2 * fjac(2,5,i,k-1)
+     >              - tmp1 * njac(2,5,i,k-1)
+
+               lhs(3,1,aa,i,k) = - tmp2 * fjac(3,1,i,k-1)
+     >              - tmp1 * njac(3,1,i,k-1)
+               lhs(3,2,aa,i,k) = - tmp2 * fjac(3,2,i,k-1)
+     >              - tmp1 * njac(3,2,i,k-1)
+               lhs(3,3,aa,i,k) = - tmp2 * fjac(3,3,i,k-1)
+     >              - tmp1 * njac(3,3,i,k-1)
+     >              - tmp1 * dz3 
+               lhs(3,4,aa,i,k) = - tmp2 * fjac(3,4,i,k-1)
+     >              - tmp1 * njac(3,4,i,k-1)
+               lhs(3,5,aa,i,k) = - tmp2 * fjac(3,5,i,k-1)
+     >              - tmp1 * njac(3,5,i,k-1)
+
+               lhs(4,1,aa,i,k) = - tmp2 * fjac(4,1,i,k-1)
+     >              - tmp1 * njac(4,1,i,k-1)
+               lhs(4,2,aa,i,k) = - tmp2 * fjac(4,2,i,k-1)
+     >              - tmp1 * njac(4,2,i,k-1)
+               lhs(4,3,aa,i,k) = - tmp2 * fjac(4,3,i,k-1)
+     >              - tmp1 * njac(4,3,i,k-1)
+               lhs(4,4,aa,i,k) = - tmp2 * fjac(4,4,i,k-1)
+     >              - tmp1 * njac(4,4,i,k-1)
+     >              - tmp1 * dz4
+               lhs(4,5,aa,i,k) = - tmp2 * fjac(4,5,i,k-1)
+     >              - tmp1 * njac(4,5,i,k-1)
+
+               lhs(5,1,aa,i,k) = - tmp2 * fjac(5,1,i,k-1)
+     >              - tmp1 * njac(5,1,i,k-1)
+               lhs(5,2,aa,i,k) = - tmp2 * fjac(5,2,i,k-1)
+     >              - tmp1 * njac(5,2,i,k-1)
+               lhs(5,3,aa,i,k) = - tmp2 * fjac(5,3,i,k-1)
+     >              - tmp1 * njac(5,3,i,k-1)
+               lhs(5,4,aa,i,k) = - tmp2 * fjac(5,4,i,k-1)
+     >              - tmp1 * njac(5,4,i,k-1)
+               lhs(5,5,aa,i,k) = - tmp2 * fjac(5,5,i,k-1)
+     >              - tmp1 * njac(5,5,i,k-1)
+     >              - tmp1 * dz5
+
+               lhs(1,1,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(1,1,i,k)
+     >              + tmp1 * 2.0d+00 * dz1
+               lhs(1,2,bb,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k)
+               lhs(1,3,bb,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k)
+               lhs(1,4,bb,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k)
+               lhs(1,5,bb,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k)
+
+               lhs(2,1,bb,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k)
+               lhs(2,2,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(2,2,i,k)
+     >              + tmp1 * 2.0d+00 * dz2
+               lhs(2,3,bb,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k)
+               lhs(2,4,bb,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k)
+               lhs(2,5,bb,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k)
+
+               lhs(3,1,bb,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k)
+               lhs(3,2,bb,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k)
+               lhs(3,3,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(3,3,i,k)
+     >              + tmp1 * 2.0d+00 * dz3
+               lhs(3,4,bb,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k)
+               lhs(3,5,bb,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k)
+
+               lhs(4,1,bb,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k)
+               lhs(4,2,bb,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k)
+               lhs(4,3,bb,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k)
+               lhs(4,4,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(4,4,i,k)
+     >              + tmp1 * 2.0d+00 * dz4
+               lhs(4,5,bb,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k)
+
+               lhs(5,1,bb,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k)
+               lhs(5,2,bb,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k)
+               lhs(5,3,bb,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k)
+               lhs(5,4,bb,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k)
+               lhs(5,5,bb,i,k) = 1.0d+00
+     >              + tmp1 * 2.0d+00 * njac(5,5,i,k) 
+     >              + tmp1 * 2.0d+00 * dz5
+
+               lhs(1,1,cc,i,k) =  tmp2 * fjac(1,1,i,k+1)
+     >              - tmp1 * njac(1,1,i,k+1)
+     >              - tmp1 * dz1
+               lhs(1,2,cc,i,k) =  tmp2 * fjac(1,2,i,k+1)
+     >              - tmp1 * njac(1,2,i,k+1)
+               lhs(1,3,cc,i,k) =  tmp2 * fjac(1,3,i,k+1)
+     >              - tmp1 * njac(1,3,i,k+1)
+               lhs(1,4,cc,i,k) =  tmp2 * fjac(1,4,i,k+1)
+     >              - tmp1 * njac(1,4,i,k+1)
+               lhs(1,5,cc,i,k) =  tmp2 * fjac(1,5,i,k+1)
+     >              - tmp1 * njac(1,5,i,k+1)
+
+               lhs(2,1,cc,i,k) =  tmp2 * fjac(2,1,i,k+1)
+     >              - tmp1 * njac(2,1,i,k+1)
+               lhs(2,2,cc,i,k) =  tmp2 * fjac(2,2,i,k+1)
+     >              - tmp1 * njac(2,2,i,k+1)
+     >              - tmp1 * dz2
+               lhs(2,3,cc,i,k) =  tmp2 * fjac(2,3,i,k+1)
+     >              - tmp1 * njac(2,3,i,k+1)
+               lhs(2,4,cc,i,k) =  tmp2 * fjac(2,4,i,k+1)
+     >              - tmp1 * njac(2,4,i,k+1)
+               lhs(2,5,cc,i,k) =  tmp2 * fjac(2,5,i,k+1)
+     >              - tmp1 * njac(2,5,i,k+1)
+
+               lhs(3,1,cc,i,k) =  tmp2 * fjac(3,1,i,k+1)
+     >              - tmp1 * njac(3,1,i,k+1)
+               lhs(3,2,cc,i,k) =  tmp2 * fjac(3,2,i,k+1)
+     >              - tmp1 * njac(3,2,i,k+1)
+               lhs(3,3,cc,i,k) =  tmp2 * fjac(3,3,i,k+1)
+     >              - tmp1 * njac(3,3,i,k+1)
+     >              - tmp1 * dz3
+               lhs(3,4,cc,i,k) =  tmp2 * fjac(3,4,i,k+1)
+     >              - tmp1 * njac(3,4,i,k+1)
+               lhs(3,5,cc,i,k) =  tmp2 * fjac(3,5,i,k+1)
+     >              - tmp1 * njac(3,5,i,k+1)
+
+               lhs(4,1,cc,i,k) =  tmp2 * fjac(4,1,i,k+1)
+     >              - tmp1 * njac(4,1,i,k+1)
+               lhs(4,2,cc,i,k) =  tmp2 * fjac(4,2,i,k+1)
+     >              - tmp1 * njac(4,2,i,k+1)
+               lhs(4,3,cc,i,k) =  tmp2 * fjac(4,3,i,k+1)
+     >              - tmp1 * njac(4,3,i,k+1)
+               lhs(4,4,cc,i,k) =  tmp2 * fjac(4,4,i,k+1)
+     >              - tmp1 * njac(4,4,i,k+1)
+     >              - tmp1 * dz4
+               lhs(4,5,cc,i,k) =  tmp2 * fjac(4,5,i,k+1)
+     >              - tmp1 * njac(4,5,i,k+1)
+
+               lhs(5,1,cc,i,k) =  tmp2 * fjac(5,1,i,k+1)
+     >              - tmp1 * njac(5,1,i,k+1)
+               lhs(5,2,cc,i,k) =  tmp2 * fjac(5,2,i,k+1)
+     >              - tmp1 * njac(5,2,i,k+1)
+               lhs(5,3,cc,i,k) =  tmp2 * fjac(5,3,i,k+1)
+     >              - tmp1 * njac(5,3,i,k+1)
+               lhs(5,4,cc,i,k) =  tmp2 * fjac(5,4,i,k+1)
+     >              - tmp1 * njac(5,4,i,k+1)
+               lhs(5,5,cc,i,k) =  tmp2 * fjac(5,5,i,k+1)
+     >              - tmp1 * njac(5,5,i,k+1)
+     >              - tmp1 * dz5
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     performs guaussian elimination on this cell.
+c     
+c     assumes that unpacking routines for non-first cells 
+c     preload C' and rhs' from previous cell.
+c     
+c     assumed send happens outside this routine, but that
+c     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     outer most do loops - sweeping in i direction
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,0) by b_inverse and copy back to c
+c     multiply rhs(0) by b_inverse(0) and copy to rhs
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call binvcrhs( lhs(1,1,bb,i,0),
+     >                        lhs(1,1,cc,i,0),
+     >                        rhs(1,i,j,0) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c     begin inner most do loop
+c     do all the elements of the cell unless last 
+c---------------------------------------------------------------------
+         do k=1,ksize-1
+!dir$ ivdep
+            do i = 1, grid_points(1)-2
+
+c---------------------------------------------------------------------
+c     subtract A*lhs_vector(k-1) from lhs_vector(k)
+c     
+c     rhs(k) = rhs(k) - A*rhs(k-1)
+c---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i,k),
+     >                         rhs(1,i,j,k-1),rhs(1,i,j,k))
+
+c---------------------------------------------------------------------
+c     B(k) = B(k) - C(k-1)*A(k)
+c     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+c---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i,k),
+     >                         lhs(1,1,cc,i,k-1),
+     >                         lhs(1,1,bb,i,k))
+
+c---------------------------------------------------------------------
+c     multiply c(i,j,k) by b_inverse and copy back to c
+c     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+c---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i,k),
+     >                        lhs(1,1,cc,i,k),
+     >                        rhs(1,i,j,k) )
+
+            enddo
+         enddo
+
+c---------------------------------------------------------------------
+c     Now finish up special cases for last cell
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+c---------------------------------------------------------------------
+!dir$ ivdep
+         do i = 1, grid_points(1)-2
+            call matvec_sub(lhs(1,1,aa,i,ksize),
+     >                         rhs(1,i,j,ksize-1),rhs(1,i,j,ksize))
+
+c---------------------------------------------------------------------
+c     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+c     call matmul_sub(aa,i,j,ksize,c,
+c     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+c---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,i,ksize),
+     >                         lhs(1,1,cc,i,ksize-1),
+     >                         lhs(1,1,bb,i,ksize))
+
+c---------------------------------------------------------------------
+c     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+c---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,i,ksize),
+     >                       rhs(1,i,j,ksize) )
+         enddo
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+c     else assume U(ksize) is loaded in un pack backsub_info
+c     so just use it
+c     after call u(kstart) will be sent to next cell
+c---------------------------------------------------------------------
+
+         do k=ksize-1,0,-1
+            do i = 1, grid_points(1)-2
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k) 
+     >                    - lhs(m,n,cc,i,k)*rhs(n,i,j,k+1)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/Makefile
new file mode 100644
index 0000000..61c9ac8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=cg
+BENCHMARKU=CG
+
+include ../config/make.def
+
+OBJS = cg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+cg.o:		cg.f  globals.h npbparams.h
+	${FCOMPILE} cg.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/README.carefully b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/README.carefully
new file mode 100644
index 0000000..cdcc366
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/README.carefully
@@ -0,0 +1,16 @@
+Note: please observe that in the routine conj_grad three 
+implementations of the sparse matrix-vector multiply have
+been supplied.  The default matrix-vector multiply is not
+loop unrolled.  The alternate implementations are unrolled
+to a depth of 2 and unrolled to a depth of 8.  Please
+experiment with these to find the fastest for your particular
+architecture.  If reporting timing results, any of these three may
+be used without penalty.
+
+Performance examples:
+The non-unrolled version of the multiply is actually (slightly: 
+maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the 
+unrolled-by-2 version below.   On the Cray t3d, the reverse is true, 
+i.e., the unrolled-by-two version is some 10% faster.  
+The unrolled-by-8 version below is significantly faster
+on the Cray t3d - overall speed of code is 1.5 times faster.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/cg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/cg.f
new file mode 100644
index 0000000..556b92b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/cg.f
@@ -0,0 +1,1035 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   C G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB CG code.               !
+!    Refer to NAS Technical Reports 95-020 for details.                   !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c      NPB CG serial version      
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c Authors: M. Yarrow
+c          C. Kuszmaul
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      program cg
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'globals.h'
+
+
+      common / main_int_mem /  colidx,     rowstr,
+     >                         iv,         arow,     acol
+      integer                  colidx(nz), rowstr(na+1),
+     >                         iv(na),  arow(na), acol(naz)
+
+
+      common / main_flt_mem /  aelt,     a,
+     >                         x,
+     >                         z,
+     >                         p,
+     >                         q,
+     >                         r
+      double precision         aelt(naz), a(nz),
+     >                         x(na+2),
+     >                         z(na+2),
+     >                         p(na+2),
+     >                         q(na+2),
+     >                         r(na+2)
+
+
+
+      integer            i, j, k, it
+
+      double precision   zeta, randlc
+      external           randlc
+      double precision   rnorm
+      double precision   norm_temp1,norm_temp2
+
+      double precision   t, mflops, tmax
+      character          class
+      logical            verified
+      double precision   zeta_verify_value, epsilon, err
+
+      integer   fstatus
+      character t_names(t_last)*8
+
+      do i = 1, T_last
+         call timer_clear( i )
+      end do
+
+      open(unit=2, file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_conj_grad) = 'conjgd'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+      call timer_start( T_init )
+
+      firstrow = 1
+      lastrow  = na
+      firstcol = 1
+      lastcol  = na
+
+
+      if( na .eq. 1400 .and. 
+     &    nonzer .eq. 7 .and. 
+     &    niter .eq. 15 .and.
+     &    shift .eq. 10.d0 ) then
+         class = 'S'
+         zeta_verify_value = 8.5971775078648d0
+      else if( na .eq. 7000 .and. 
+     &         nonzer .eq. 8 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 12.d0 ) then
+         class = 'W'
+         zeta_verify_value = 10.362595087124d0
+      else if( na .eq. 14000 .and. 
+     &         nonzer .eq. 11 .and. 
+     &         niter .eq. 15 .and.
+     &         shift .eq. 20.d0 ) then
+         class = 'A'
+         zeta_verify_value = 17.130235054029d0
+      else if( na .eq. 75000 .and. 
+     &         nonzer .eq. 13 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 60.d0 ) then
+         class = 'B'
+         zeta_verify_value = 22.712745482631d0
+      else if( na .eq. 150000 .and. 
+     &         nonzer .eq. 15 .and. 
+     &         niter .eq. 75 .and.
+     &         shift .eq. 110.d0 ) then
+         class = 'C'
+         zeta_verify_value = 28.973605592845d0
+      else if( na .eq. 1500000 .and. 
+     &         nonzer .eq. 21 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 500.d0 ) then
+         class = 'D'
+         zeta_verify_value = 52.514532105794d0
+      else if( na .eq. 9000000 .and. 
+     &         nonzer .eq. 26 .and. 
+     &         niter .eq. 100 .and.
+     &         shift .eq. 1.5d3 ) then
+         class = 'E'
+         zeta_verify_value = 77.522164599383d0
+      else
+         class = 'U'
+      endif
+
+      write( *,1000 ) 
+      write( *,1001 ) na
+      write( *,1002 ) niter
+      write( *,* )
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - CG Benchmark', /)
+ 1001 format(' Size: ', i11 )
+ 1002 format(' Iterations: ', i5 )
+
+
+
+      naa = na
+      nzz = nz
+
+
+c---------------------------------------------------------------------
+c  Inialize random number generator
+c---------------------------------------------------------------------
+      tran    = 314159265.0D0
+      amult   = 1220703125.0D0
+      zeta    = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c  
+c---------------------------------------------------------------------
+      call makea(naa, nzz, a, colidx, rowstr, 
+     >           firstrow, lastrow, firstcol, lastcol, 
+     >           arow, acol, aelt, iv)
+
+
+
+c---------------------------------------------------------------------
+c  Note: as a result of the above call to makea:
+c        values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1
+c        values of colidx which are col indexes go from firstcol --> lastcol
+c        So:
+c        Shift the col index vals from actual (firstcol --> lastcol ) 
+c        to local, i.e., (1 --> lastcol-firstcol+1)
+c---------------------------------------------------------------------
+      do j=1,lastrow-firstrow+1
+         do k=rowstr(j),rowstr(j+1)-1
+            colidx(k) = colidx(k) - firstcol + 1
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+      do j=1, lastcol-firstcol+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = 0.0d0
+         p(j) = 0.0d0
+      enddo
+
+      zeta  = 0.0d0
+
+c---------------------------------------------------------------------
+c---->
+c  Do one iteration untimed to init all code and data page tables
+c---->                    (then reinit, start timing, to niter its)
+c---------------------------------------------------------------------
+      do it = 1, 1
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    rnorm )
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+
+         norm_temp2 = 1.0d0 / sqrt( norm_temp2 )
+
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp2*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of do one iteration untimed
+
+
+c---------------------------------------------------------------------
+c  set starting vector to (1, 1, .... 1)
+c---------------------------------------------------------------------
+c
+c  
+c
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+
+      zeta  = 0.0d0
+
+      call timer_stop( T_init )
+
+      write (*, 2000) timer_read(T_init)
+ 2000 format(' Initialization time = ',f15.3,' seconds')
+
+      call timer_start( T_bench )
+
+c---------------------------------------------------------------------
+c---->
+c  Main Iteration for inverse power method
+c---->
+c---------------------------------------------------------------------
+      do it = 1, niter
+
+c---------------------------------------------------------------------
+c  The call to the conjugate gradient routine:
+c---------------------------------------------------------------------
+         if ( timeron ) call timer_start( T_conj_grad )
+         call conj_grad ( colidx,
+     >                    rowstr,
+     >                    x,
+     >                    z,
+     >                    a,
+     >                    p,
+     >                    q,
+     >                    r,
+     >                    rnorm )
+         if ( timeron ) call timer_stop( T_conj_grad )
+
+
+c---------------------------------------------------------------------
+c  zeta = shift + 1/(x.z)
+c  So, first: (x.z)
+c  Also, find norm of z
+c  So, first: (z.z)
+c---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+
+
+         norm_temp2 = 1.0d0 / sqrt( norm_temp2 )
+
+
+         zeta = shift + 1.0d0 / norm_temp1
+         if( it .eq. 1 ) write( *,9000 )
+         write( *,9001 ) it, rnorm, zeta
+
+ 9000    format( /,'   iteration           ||r||                 zeta' )
+ 9001    format( 4x, i5, 7x, e20.14, f20.13 )
+
+c---------------------------------------------------------------------
+c  Normalize z to obtain x
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp2*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of main iter inv pow meth
+
+      call timer_stop( T_bench )
+
+c---------------------------------------------------------------------
+c  End of timed section
+c---------------------------------------------------------------------
+
+      t = timer_read( T_bench )
+
+
+      write(*,100)
+ 100  format(' Benchmark completed ')
+
+      epsilon = 1.d-10
+      if (class .ne. 'U') then
+
+c         err = abs( zeta - zeta_verify_value)
+         err = abs( zeta - zeta_verify_value )/zeta_verify_value
+         if( err .le. epsilon ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) zeta
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' Zeta is    ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300) 
+            write(*, 301) zeta
+            write(*, 302) zeta_verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' Zeta                ', E20.13)
+ 302        format(' The correct zeta is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) zeta
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+
+      if( t .ne. 0. ) then
+         mflops = float( 2*niter*na )
+     &               * ( 3.+float( nonzer*(nonzer+1) )
+     &                 + 25.*(5.+float( nonzer*(nonzer+1) ))
+     &                 + 3. ) / t / 1000000.0
+      else
+         mflops = 0.0
+      endif
+
+
+         call print_results('CG', class, na, 0, 0,
+     >                      niter, t,
+     >                      mflops, '          floating point', 
+     >                      verified, npbversion, compiletime,
+     >                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+
+ 600  format( i4, 2e19.12)
+
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(T_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=1, t_last
+         t = timer_read(i)
+         if (i.eq.t_init) then
+            write(*,810) t_names(i), t
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+            if (i.eq.t_conj_grad) then
+               t = tmax - t
+               write(*,820) 'rest', t, t*100./tmax
+            endif
+         endif
+ 810     format(2x,a8,':',f9.3:'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+
+      end                              ! end main
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine conj_grad ( colidx,
+     >                       rowstr,
+     >                       x,
+     >                       z,
+     >                       a,
+     >                       p,
+     >                       q,
+     >                       r,
+     >                       rnorm )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  Floaging point arrays here are named as in NPB1 spec discussion of 
+c  CG algorithm
+c---------------------------------------------------------------------
+ 
+      implicit none
+
+
+      include 'globals.h'
+
+
+      double precision   x(*),
+     >                   z(*),
+     >                   a(nzz)
+      integer            colidx(nzz), rowstr(naa+1)
+
+      double precision   p(*),
+     >                   q(*),
+     >                   r(*)
+
+
+      integer   j, k
+      integer   cgit, cgitmax
+
+      double precision   d, sum, rho, rho0, alpha, beta, rnorm
+
+      data      cgitmax / 25 /
+
+
+      rho = 0.0d0
+
+c---------------------------------------------------------------------
+c  Initialize the CG algorithm:
+c---------------------------------------------------------------------
+      do j=1,naa+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = x(j)
+         p(j) = r(j)
+      enddo
+
+
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+      do j=1, lastcol-firstcol+1
+         rho = rho + r(j)*r(j)
+      enddo
+
+c---------------------------------------------------------------------
+c---->
+c  The conj grad iteration loop
+c---->
+c---------------------------------------------------------------------
+      do cgit = 1, cgitmax
+
+c---------------------------------------------------------------------
+c  q = A.p
+c  The partition submatrix-vector multiply: use workspace w
+c---------------------------------------------------------------------
+C
+C  NOTE: this version of the multiply is actually (slightly: maybe %5) 
+C        faster on the sp2 on 16 nodes than is the unrolled-by-2 version 
+C        below.   On the Cray t3d, the reverse is true, i.e., the 
+C        unrolled-by-two version is some 10% faster.  
+C        The unrolled-by-8 version below is significantly faster
+C        on the Cray t3d - overall speed of code is 1.5 times faster.
+C
+         do j=1,lastrow-firstrow+1
+            sum = 0.d0
+            do k=rowstr(j),rowstr(j+1)-1
+               sum = sum + a(k)*p(colidx(k))
+            enddo
+            q(j) = sum
+         enddo
+
+CC          do j=1,lastrow-firstrow+1
+CC             i = rowstr(j) 
+CC             iresidue = mod( rowstr(j+1)-i, 2 )
+CC             sum1 = 0.d0
+CC             sum2 = 0.d0
+CC             if( iresidue .eq. 1 )
+CC      &          sum1 = sum1 + a(i)*p(colidx(i))
+CC             do k=i+iresidue, rowstr(j+1)-2, 2
+CC                sum1 = sum1 + a(k)  *p(colidx(k))
+CC                sum2 = sum2 + a(k+1)*p(colidx(k+1))
+CC             enddo
+CC             q(j) = sum1 + sum2
+CC          enddo
+
+CC          do j=1,lastrow-firstrow+1
+CC             i = rowstr(j) 
+CC             iresidue = mod( rowstr(j+1)-i, 8 )
+CC             sum = 0.d0
+CC             do k=i,i+iresidue-1
+CC                sum = sum +  a(k)*p(colidx(k))
+CC             enddo
+CC             do k=i+iresidue, rowstr(j+1)-8, 8
+CC                sum = sum + a(k  )*p(colidx(k  ))
+CC      &                   + a(k+1)*p(colidx(k+1))
+CC      &                   + a(k+2)*p(colidx(k+2))
+CC      &                   + a(k+3)*p(colidx(k+3))
+CC      &                   + a(k+4)*p(colidx(k+4))
+CC      &                   + a(k+5)*p(colidx(k+5))
+CC      &                   + a(k+6)*p(colidx(k+6))
+CC      &                   + a(k+7)*p(colidx(k+7))
+CC             enddo
+CC             q(j) = sum
+CC          enddo
+            
+
+
+c---------------------------------------------------------------------
+c  Obtain p.q
+c---------------------------------------------------------------------
+         d = 0.0d0
+         do j=1, lastcol-firstcol+1
+            d = d + p(j)*q(j)
+         enddo
+
+
+c---------------------------------------------------------------------
+c  Obtain alpha = rho / (p.q)
+c---------------------------------------------------------------------
+         alpha = rho / d
+
+c---------------------------------------------------------------------
+c  Save a temporary of rho
+c---------------------------------------------------------------------
+         rho0 = rho
+
+c---------------------------------------------------------------------
+c  Obtain z = z + alpha*p
+c  and    r = r - alpha*q
+c---------------------------------------------------------------------
+         rho = 0.0d0
+         do j=1, lastcol-firstcol+1
+            z(j) = z(j) + alpha*p(j)
+            r(j) = r(j) - alpha*q(j)
+         enddo
+            
+c---------------------------------------------------------------------
+c  rho = r.r
+c  Now, obtain the norm of r: First, sum squares of r elements locally...
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            rho = rho + r(j)*r(j)
+         enddo
+
+c---------------------------------------------------------------------
+c  Obtain beta:
+c---------------------------------------------------------------------
+         beta = rho / rho0
+
+c---------------------------------------------------------------------
+c  p = r + beta*p
+c---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            p(j) = r(j) + beta*p(j)
+         enddo
+
+
+      enddo                             ! end of do cgit=1,cgitmax
+
+
+c---------------------------------------------------------------------
+c  Compute residual norm explicitly:  ||r|| = ||x - A.z||
+c  First, form A.z
+c  The partition submatrix-vector multiply
+c---------------------------------------------------------------------
+      sum = 0.0d0
+      do j=1,lastrow-firstrow+1
+         d = 0.d0
+         do k=rowstr(j),rowstr(j+1)-1
+            d = d + a(k)*z(colidx(k))
+         enddo
+         r(j) = d
+      enddo
+
+
+c---------------------------------------------------------------------
+c  At this point, r contains A.z
+c---------------------------------------------------------------------
+      do j=1, lastcol-firstcol+1
+         d   = x(j) - r(j)         
+         sum = sum + d*d
+      enddo
+
+      rnorm = sqrt( sum )
+
+
+
+      return
+      end                               ! end of routine conj_grad
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine makea( n, nz, a, colidx, rowstr, 
+     >                  firstrow, lastrow, firstcol, lastcol,
+     >                  arow, acol, aelt, iv )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit            none
+      include             'npbparams.h'
+
+      integer             n, nz
+      integer             firstrow, lastrow, firstcol, lastcol
+      integer             colidx(nz), rowstr(n+1)
+      integer             iv(n), arow(n), acol(nonzer+1,n)
+      double precision    aelt(nonzer+1,n)
+      double precision    a(nz)
+
+c---------------------------------------------------------------------
+c       generate the test problem for benchmark 6
+c       makea generates a sparse matrix with a
+c       prescribed sparsity distribution
+c
+c       parameter    type        usage
+c
+c       input
+c
+c       n            i           number of cols/rows of matrix
+c       nz           i           nonzeros as declared array size
+c       rcond        r*8         condition number
+c       shift        r*8         main diagonal shift
+c
+c       output
+c
+c       a            r*8         array for nonzeros
+c       colidx       i           col indices
+c       rowstr       i           row pointers
+c
+c       workspace
+c
+c       iv, arow, acol i
+c       aelt           r*8
+c---------------------------------------------------------------------
+
+      integer          i, iouter, ivelt, nzv, nn1
+      integer          ivc(nonzer+1)
+      double precision vc(nonzer+1)
+
+c---------------------------------------------------------------------
+c      nonzer is approximately  (int(sqrt(nnza /n)));
+c---------------------------------------------------------------------
+
+      external          sparse, sprnvc, vecset
+
+c---------------------------------------------------------------------
+c    nn1 is the smallest power of two not less than n
+c---------------------------------------------------------------------
+
+      nn1 = 1
+ 50   continue
+        nn1 = 2 * nn1
+        if (nn1 .lt. n) goto 50
+
+c---------------------------------------------------------------------
+c  Generate nonzero positions and save for the use in sparse.
+c---------------------------------------------------------------------
+
+      do iouter = 1, n
+         nzv = nonzer
+         call sprnvc( n, nzv, nn1, vc, ivc )
+         call vecset( n, vc, ivc, nzv, iouter, .5D0 )
+         arow(iouter) = nzv
+         do ivelt = 1, nzv
+            acol(ivelt, iouter) = ivc(ivelt)
+            aelt(ivelt, iouter) = vc(ivelt)
+         enddo
+      enddo
+
+c---------------------------------------------------------------------
+c       ... make the sparse matrix from list of elements with duplicates
+c           (iv is used as  workspace)
+c---------------------------------------------------------------------
+      call sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol, 
+     >             aelt, firstrow, lastrow,
+     >             iv, rcond, shift )
+      return
+
+      end
+c-------end   of makea------------------------------
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol, 
+     >                   aelt, firstrow, lastrow,
+     >                   nzloc, rcond, shift )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            colidx(*), rowstr(*)
+      integer            firstrow, lastrow
+      integer            n, nz, nonzer, arow(*), acol(nonzer+1,*)
+      double precision   a(*), aelt(nonzer+1,*), rcond, shift
+
+c---------------------------------------------------------------------
+c       rows range from firstrow to lastrow
+c       the rowstr pointers are defined for nrows = lastrow-firstrow+1 values
+c---------------------------------------------------------------------
+      integer            nzloc(n), nrows
+
+c---------------------------------------------------
+c       generate a sparse matrix from a list of
+c       [col, row, element] tri
+c---------------------------------------------------
+
+      integer            i, j, j1, j2, nza, k, kk, nzrow, jcol
+      double precision   xi, size, scale, ratio, va
+
+c---------------------------------------------------------------------
+c    how many rows of result
+c---------------------------------------------------------------------
+      nrows = lastrow - firstrow + 1
+
+c---------------------------------------------------------------------
+c     ...count the number of triples in each row
+c---------------------------------------------------------------------
+      do j = 1, nrows+1
+         rowstr(j) = 0
+      enddo
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i) + 1
+            rowstr(j) = rowstr(j) + arow(i)
+         end do
+      end do
+
+      rowstr(1) = 1
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) + rowstr(j-1)
+      enddo
+      nza = rowstr(nrows+1) - 1
+
+c---------------------------------------------------------------------
+c     ... rowstr(j) now is the location of the first nonzero
+c           of row j of a
+c---------------------------------------------------------------------
+
+      if (nza .gt. nz) then
+         write(*,*) 'Space for matrix elements exceeded in sparse'
+         write(*,*) 'nza, nzmax = ',nza, nz
+         stop
+      endif
+
+
+c---------------------------------------------------------------------
+c     ... preload data pages
+c---------------------------------------------------------------------
+      do j = 1, nrows
+         do k = rowstr(j), rowstr(j+1)-1
+             a(k) = 0.d0
+             colidx(k) = 0
+         enddo
+         nzloc(j) = 0
+      enddo
+
+c---------------------------------------------------------------------
+c     ... generate actual values by summing duplicates
+c---------------------------------------------------------------------
+
+      size = 1.0D0
+      ratio = rcond ** (1.0D0 / dfloat(n))
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i)
+
+            scale = size * aelt(nza, i)
+            do nzrow = 1, arow(i)
+               jcol = acol(nzrow, i)
+               va = aelt(nzrow, i) * scale
+
+c---------------------------------------------------------------------
+c       ... add the identity * rcond to the generated matrix to bound
+c           the smallest eigenvalue from below by rcond
+c---------------------------------------------------------------------
+               if (jcol .eq. j .and. j .eq. i) then
+                  va = va + rcond - shift
+               endif
+
+               do k = rowstr(j), rowstr(j+1)-1
+                  if (colidx(k) .gt. jcol) then
+c---------------------------------------------------------------------
+c       ... insert colidx here orderly
+c---------------------------------------------------------------------
+                     do kk = rowstr(j+1)-2, k, -1
+                        if (colidx(kk) .gt. 0) then
+                           a(kk+1)  = a(kk)
+                           colidx(kk+1) = colidx(kk)
+                        endif
+                     enddo
+                     colidx(k) = jcol
+                     a(k)  = 0.d0
+                     goto 40
+                  else if (colidx(k) .eq. 0) then
+                     colidx(k) = jcol
+                     goto 40
+                  else if (colidx(k) .eq. jcol) then
+c---------------------------------------------------------------------
+c       ... mark the duplicated entry
+c---------------------------------------------------------------------
+                     nzloc(j) = nzloc(j) + 1
+                     goto 40
+                  endif
+               enddo
+               print *,'internal error in sparse: i=',i
+               stop
+   40          continue
+               a(k) = a(k) + va
+            enddo
+   60       continue
+         enddo
+         size = size * ratio
+      enddo
+
+
+c---------------------------------------------------------------------
+c       ... remove empty entries and generate final results
+c---------------------------------------------------------------------
+      do j = 2, nrows
+         nzloc(j) = nzloc(j) + nzloc(j-1)
+      enddo
+
+      do j = 1, nrows
+         if (j .gt. 1) then
+            j1 = rowstr(j) - nzloc(j-1)
+         else
+            j1 = 1
+         endif
+         j2 = rowstr(j+1) - nzloc(j) - 1
+         nza = rowstr(j)
+         do k = j1, j2
+            a(k) = a(nza)
+            colidx(k) = colidx(nza)
+            nza = nza + 1
+         enddo
+      enddo
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) - nzloc(j-1)
+      enddo
+      nza = rowstr(nrows+1) - 1
+
+
+CC       write (*, 11000) nza
+      return
+11000   format ( //,'final nonzero count in sparse ',
+     1            /,'number of nonzeros       = ', i16 )
+      end
+c-------end   of sparse-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine sprnvc( n, nz, nn1, v, iv )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   v(*)
+      integer            n, nz, nn1, iv(*)
+      common /urando/    amult, tran
+      double precision   amult, tran
+
+
+c---------------------------------------------------------------------
+c       generate a sparse n-vector (v, iv)
+c       having nzv nonzeros
+c
+c       mark(i) is set to 1 if position i is nonzero.
+c       mark is all zero on entry and is reset to all zero before exit
+c       this corrects a performance bug found by John G. Lewis, caused by
+c       reinitialization of mark on every one of the n calls to sprnvc
+c---------------------------------------------------------------------
+
+        integer            nzv, ii, i, icnvrt
+
+        external           randlc, icnvrt
+        double precision   randlc, vecelt, vecloc
+
+
+        nzv = 0
+
+100     continue
+        if (nzv .ge. nz) goto 110
+
+         vecelt = randlc( tran, amult )
+
+c---------------------------------------------------------------------
+c   generate an integer between 1 and n in a portable manner
+c---------------------------------------------------------------------
+         vecloc = randlc(tran, amult)
+         i = icnvrt(vecloc, nn1) + 1
+         if (i .gt. n) goto 100
+
+c---------------------------------------------------------------------
+c  was this integer generated already?
+c---------------------------------------------------------------------
+         do ii = 1, nzv
+            if (iv(ii) .eq. i) goto 100
+         enddo
+         nzv = nzv + 1
+         v(nzv) = vecelt
+         iv(nzv) = i
+         goto 100
+110     continue
+
+      return
+      end
+c-------end   of sprnvc-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      function icnvrt(x, ipwr2)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      double precision   x
+      integer            ipwr2, icnvrt
+
+c---------------------------------------------------------------------
+c    scale a double precision number x in (0,1) by a power of 2 and chop it
+c---------------------------------------------------------------------
+      icnvrt = int(ipwr2 * x)
+
+      return
+      end
+c-------end   of icnvrt-----------------------------
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine vecset(n, v, iv, nzv, i, val)
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit           none
+      integer            n, iv(*), nzv, i, k
+      double precision   v(*), val
+
+c---------------------------------------------------------------------
+c       set ith element of sparse vector (v, iv) with
+c       nzv nonzeros to val
+c---------------------------------------------------------------------
+
+      logical set
+
+      set = .false.
+      do k = 1, nzv
+         if (iv(k) .eq. i) then
+            v(k) = val
+            set  = .true.
+         endif
+      enddo
+      if (.not. set) then
+         nzv     = nzv + 1
+         v(nzv)  = val
+         iv(nzv) = i
+      endif
+      return
+      end
+c-------end   of vecset-----------------------------
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/globals.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/globals.h
new file mode 100644
index 0000000..469ed32
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/CG/globals.h
@@ -0,0 +1,105 @@
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c  Note: please observe that in the routine conj_grad three 
+c  implementations of the sparse matrix-vector multiply have
+c  been supplied.  The default matrix-vector multiply is not
+c  loop unrolled.  The alternate implementations are unrolled
+c  to a depth of 2 and unrolled to a depth of 8.  Please
+c  experiment with these to find the fastest for your particular
+c  architecture.  If reporting timing results, any of these three may
+c  be used without penalty.
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c  Class specific parameters: 
+c  It appears here for reference only.
+c  These are their values, however, this info is imported in the npbparams.h
+c  include file, which is written by the sys/setparams.c program.
+c---------------------------------------------------------------------
+
+C----------
+C  Class S:
+C----------
+CC       parameter( na=1400, 
+CC      >           nonzer=7, 
+CC      >           shift=10., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class W:
+C----------
+CC       parameter( na=7000,
+CC      >           nonzer=8, 
+CC      >           shift=12., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class A:
+C----------
+CC       parameter( na=14000,
+CC      >           nonzer=11, 
+CC      >           shift=20., 
+CC      >           niter=15,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class B:
+C----------
+CC       parameter( na=75000, 
+CC      >           nonzer=13, 
+CC      >           shift=60., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class C:
+C----------
+CC       parameter( na=150000, 
+CC      >           nonzer=15, 
+CC      >           shift=110., 
+CC      >           niter=75,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class D:
+C----------
+CC       parameter( na=1500000, 
+CC      >           nonzer=21, 
+CC      >           shift=500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+C----------
+C  Class E:
+C----------
+CC       parameter( na=9000000, 
+CC      >           nonzer=26, 
+CC      >           shift=1500., 
+CC      >           niter=100,
+CC      >           rcond=1.0d-1 )
+
+
+      integer    nz, naz
+      parameter( nz = na*(nonzer+1)*(nonzer+1) )
+      parameter( naz = na*(nonzer+1) )
+
+			      	
+      common / partit_size  / 	 naa, nzz, 
+     >                        	 firstrow, 
+     >                           lastrow, 
+     >                           firstcol, 
+     >                           lastcol
+      integer                 	 naa, nzz, 
+     >                        	 firstrow, 
+     >                           lastrow, 
+     >                           firstcol, 
+     >                           lastcol
+			      	
+      common /urando/         	 amult, tran
+      double precision           amult, tran
+
+      external         timer_read
+      double precision timer_read
+
+      integer T_init, T_bench, T_conj_grad, T_last
+      parameter (T_init=1, T_bench=2, T_conj_grad=3, T_last=3)
+      logical timeron
+      common /timers/ timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/ADC.par b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/ADC.par
new file mode 100644
index 0000000..05f9ce7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/ADC.par
@@ -0,0 +1,5 @@
+attrNum=12
+measuresNum=1
+tuplesNum=100
+INVERSE_ENDIAN=0
+fileName=ADC
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/Makefile
new file mode 100644
index 0000000..2db7a8c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/Makefile
@@ -0,0 +1,33 @@
+SHELL=/bin/sh
+BENCHMARK=dc
+BENCHMARKU=DC
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = adc.o dc.o extbuild.o rbt.o jobcntl.o \
+	${COMMON}/c_print_results.o  \
+	${COMMON}/c_timers.o ${COMMON}/c_wtime.o
+
+
+# npbparams.h is provided for backward compatibility with NPB compilation
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS} 
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+	$(CCOMPILE) $<
+
+adc.o:      adc.c npbparams.h
+dc.o:       dc.c adcc.h adc.h macrodef.h npbparams.h
+extbuild.o: extbuild.c adcc.h adc.h macrodef.h npbparams.h
+rbt.o:      rbt.c adcc.h adc.h rbt.h macrodef.h npbparams.h
+jobcntl.o:  jobcntl.c adcc.h adc.h macrodef.h npbparams.h
+
+clean:
+	- rm -f *.o 
+	- rm -f npbparams.h core
+	- rm -f {../,}ADC.{logf,view,dat,viewsz,groupby,chunks}.* 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/README
new file mode 100644
index 0000000..cbe7a06
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/README
@@ -0,0 +1,74 @@
+1. Compilation
+   DC benchmark uses the same directory tree as NPB3.0 (and NPB2.3) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary
+   (an example of make.def provided in DC directory). 
+   Then
+      make dc CLASS=S
+
+   If a compiler complains about type 'int64' already defined, add
+   "-DHAS_INT64" to the CFLAGS list in make.def.
+
+2. Run
+   A text file ADC.par is used to set DC parameters when the class 
+   is undefined (U). 
+   The file has 5 lines. The lines with 'key' words attrNum, measuresNum, 
+   and tuplesNum define the number of dimensions, measures,
+   and input tuples respectively. There a special parameter INVERSE_ENDIAN
+   allows us to create data in non-native endian format (INVERSE_ENDIAN=1). 
+   The last parameter(fileName) specifies a DC file set name, including
+   (optionally) a full path to a directory which will contain all
+   DC related files.
+
+   An example of the DC parameter file is as follows:
+
+   attrNum=9
+   measuresNum=1
+   tuplesNum=125000
+   class=U
+   INVERSE_ENDIAN=0
+   fileName=ADC
+   
+   After parameter are set run benchmark
+   bin/dc.S 100000000 DC/ADC.par 
+   where 100000000 is the memory size allowed to be allocated for 
+   the in-core data.
+   
+3. DC processing modes
+   The DC benchmark can be run in two modes (in-core and out-of-core).
+   A desirable mode should be set before compilation in the file adc.h.
+   If a flag IN_CORE is on, the benchmark will calculate all views in main
+   memory. In this case we can use an additional flag VIEW_FILE_OUTPUT to
+   allow writing all views into disk files.
+
+   If the flag IN_CORE is off, the DC benchmark will run in a regular mode
+   using disks to store interim and result data which may not fit in main
+   memory.
+
+   _FILE_OFFSET_BITS=64 _LARGEFILE64_SOURCE -are standard compiler flags
+   which allow DC to work with files larger than 2GB.
+
+   OPTIMIZATION turns on some nonstandard DC optimizations such as obtaining
+   a view by scanning existing views. These optimizations do not always 
+   guarantee reduction in the computing time.
+
+4. Tested architectures:
+   SUN Ultrasparc 60
+   SUNFire 880
+   Origin 2000, 3000, 3800
+   MAC G4 
+   Xeon + Mandrake Linux
+
+5. setparams utility is used for generation of the npbparams.h file only 
+   for compatibility with the existing make facility of NPB. By the same
+   reason CLASS is appended to the DC executable name. It does not limit 
+   the sizes the executable can perform. The class is an input value
+   specified in ADC.par file. Providing ADC.par overrides compiled 
+   defaults in npbparams.h file.
+
+6. Known issues
+   If the benchmark runs out of disk space, a message like
+   "Write error from WriteToFile()" may not be printed. Instead,
+   the benchmark returns with UNSUCCESSFUL verification. In this case 
+   users are advised to check whether the file system is full before 
+   reporting a problem with the benchmark.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.c
new file mode 100644
index 0000000..7151826
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.c
@@ -0,0 +1,636 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+
+#define BlockSize 1024
+
+void swap4(void * num){
+  char t, *p;
+  p = (char *) num;
+  t = *p; *p = *(p + 3); *(p + 3) = t;
+  t = *(p + 1); *(p + 1) = *(p + 2); *(p + 2) = t;
+}
+void swap8(void * num){
+  char t, *p;
+  p = (char *) num;	  
+  t = *p; *p = *(p + 7); *(p + 7) = t;
+  t = *(p + 1); *(p + 1) = *(p + 6); *(p + 6) = t;
+  t = *(p + 2); *(p + 2) = *(p + 5); *(p + 5) = t;
+  t = *(p + 3); *(p + 3) = *(p + 4); *(p + 4) = t;
+}
+void initADCpar(ADC_PAR *par){
+  par->ndid=0;
+  par->dim=5;
+  par->mnum=1;
+  par->tuplenum=100;
+/*  par->isascii=1; */
+  par->inverse_endian=0;
+  par->filename="ADC";
+  par->clss='U';
+}
+int ParseParFile(char* parfname,ADC_PAR *par);
+int GenerateADC(ADC_PAR *par);
+
+typedef struct Factorization{
+  long int *mlt;
+  long int *exp;
+  long int dim;
+} Factorization;
+
+void ShowFactorization(Factorization *nmbfct){
+  int i=0;
+  for(i=0;i<nmbfct->dim;i++){
+    if(nmbfct->mlt[i]==1){
+      if(i==0) fprintf(stdout,"prime.");
+      break;
+    }
+    if(i>0) fprintf(stdout,"*");
+    if(nmbfct->exp[i]==1)
+      fprintf(stdout,"%ld",nmbfct->mlt[i]);    
+    else 
+      fprintf(stdout,"%ld^%ld",nmbfct->mlt[i],
+                               nmbfct->exp[i]);
+  }
+  fprintf(stdout,"\n");
+}
+
+long int adcprime[]={
+  421,601,631,701,883,
+  419,443,647,21737,31769,
+  1427,18353,22817,34337,98717,
+  3527,8693,9677,11093,18233};
+  
+long int ListFirstPrimes(long int mpr,long int *prlist){
+/*
+  fprintf(stdout,"ListFirstPrimes: listing primes less than %ld...\n",
+                 mpr);
+*/
+  long int prnum=0;
+  int composed=0;
+  long int nmb=0,j=0;
+  prlist[prnum++]=2;
+  prlist[prnum++]=3;
+  prlist[prnum++]=5;
+  prlist[prnum++]=7;
+  for(nmb=8;nmb<mpr;nmb++){
+    composed=0;
+    for(j=0;prlist[j]*prlist[j]<=nmb;j++){
+      if(nmb-prlist[j]*((long int)(nmb/prlist[j]))==0){
+        composed=1;
+	break;
+      }
+    }
+    if(composed==0) prlist[prnum++]=nmb;
+  }
+/*  fprintf(stdout,"ListFirstPrimes: Done.\n"); */
+  return prnum;
+}
+
+long long int LARGE_NUM=0x4FFFFFFFFFFFFFFFLL;
+long long int maxprmfctr=59;
+
+long long int GetLCM(long long int mask,
+                     Factorization **fctlist,
+		     long int *adcexpons){
+  int i=0,j=0,k=0;
+  int* expons=(int*) calloc(maxprmfctr+1,sizeof(int));
+  long long int LCM=1;
+  long int pr=2;
+  int genexp=1,lexp=1,fct=2;
+
+  for(i=0;i<maxprmfctr+1;i++)expons[i]=0;
+  i=0;
+  while(mask>0){
+    if(mask==2*(mask/2)){
+      mask=mask>>1;
+      i++;  
+      continue;
+    }
+    pr=adcprime[i];
+    genexp=adcexpons[i];
+/*
+  fprintf(stdout,"[%ld,%ld]\n",pr,genexp);
+  ShowFactorization(fctlist[genexp]);
+*/
+    for(j=0;j<fctlist[pr-1]->dim;j++){
+      fct=fctlist[pr-1]->mlt[j];
+      lexp=fctlist[pr-1]->exp[j];
+
+      for(k=0;k<fctlist[genexp]->dim;k++){
+        if(fctlist[genexp]->mlt[k]==1) break;
+        if(fct!=fctlist[genexp]->mlt[k]) continue;
+        lexp-=fctlist[genexp]->exp[k];
+	break;
+      }
+      if(expons[fct]<lexp)expons[fct]=lexp;
+    }
+    mask=mask>>1;
+    i++;
+  }
+/*
+for(i=0;i<maxprmfctr;i++){
+  if(expons[i]>0) fprintf(stdout,"*%ld^%ld",i,expons[i]);
+}
+fprintf(stdout,"\n");
+*/
+  for(i=0;i<=maxprmfctr;i++){
+    while(expons[i]>0){
+      LCM*=i;
+      if(LCM>LARGE_NUM/maxprmfctr) return LCM;
+      expons[i]--;
+    }
+  }
+/*  fprintf(stdout,"==== %lld\n",LCM); */
+  free(expons);
+  return LCM;
+}
+void ExtendFactors(long int nmb,long int firstdiv,
+                   Factorization *nmbfct,Factorization **fctlist){
+  Factorization *divfct=fctlist[nmb/firstdiv];
+  int fdivused=0;
+  int multnum=0;
+  int i=0;
+/*  fprintf(stdout,"==== %lld %ld %ld\n",divfct->dim,nmb,firstdiv); */
+   for(i=0;i<divfct->dim;i++){
+    if(divfct->mlt[i]==1){
+      if(fdivused==0){
+        nmbfct->mlt[multnum]=firstdiv;
+        nmbfct->exp[multnum]=1;   
+      }
+      break;
+    }
+    if(divfct->mlt[i]<firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i];
+      multnum++;
+    }else if(divfct->mlt[i]==firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i]+1;   
+      fdivused=1;
+    }else{
+      int j=i;
+      if(fdivused==0) j=i+1;
+      nmbfct->mlt[j]=divfct->mlt[i];
+      nmbfct->exp[j]=divfct->exp[i];    
+    }
+  }
+}
+void GetFactorization(long int prnum,long int *prlist,
+                            Factorization **fctlist){
+/*fprintf(stdout,"GetFactorization: factorizing first %ld numbers.\n",
+                prnum);*/
+  long int i=0,j=0;
+  Factorization *fct=(Factorization*)malloc(2*sizeof(Factorization)); 
+  long int len=0,isft=0,div=1,firstdiv=1;
+
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=2;
+  fct->exp[0]=1;
+  fctlist[2]=fct;
+
+  fct=(Factorization*)malloc(2*sizeof(Factorization));
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=3;
+  fct->exp[0]=1;
+  fctlist[3]=fct;
+ 
+  for(i=0;i<prlist[prnum-1];i++){
+    len=0;
+    isft=i;
+    while(isft>0){
+      len++;
+      isft=isft>>1;
+    }
+    fct=(Factorization*)malloc(2*sizeof(Factorization));
+    fct->dim=len;
+    if (len==0) len=1;
+    fct->mlt=(long int*)malloc(len*sizeof(long int));
+    fct->exp=(long int*)malloc(len*sizeof(long int));
+    for(j=0;j<fct->dim;j++){
+      fct->mlt[j]=1;
+      fct->exp[j]=0;
+    }
+    div=1;
+    for(j=0;prlist[j]*prlist[j]<=i;j++){
+      firstdiv=prlist[j];
+      if(i-firstdiv*((long int)i/firstdiv)==0){
+        div=firstdiv;
+        if(firstdiv*firstdiv==i){
+          fct->mlt[0]=firstdiv;
+          fct->exp[0]=2;	  
+	}else{
+	  ExtendFactors(i,firstdiv,fct,fctlist);
+        }
+	break;
+      }
+    }
+    if(div==1){
+      fct->mlt[0]=i;
+      fct->exp[0]=1;   
+    }
+    fctlist[i]=fct;
+/*
+     ShowFactorization(fct);
+*/
+  }
+/*  fprintf(stdout,"GetFactorization: Done.\n"); */
+}
+
+long int adcexp[]={
+  11,13,17,19,23,
+  23,29,31,37,41,	     	  
+  41,43,47,53,59,	     	  
+  3,5,7,11,13};
+long int adcexpS[]={
+  11,13,17,19,23};
+long int adcexpW[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  23,29,31,2*2,2*2*19};
+long int adcexpA[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                    
+  2*23,2*2*2*2,2*2*2*2*2*23,2*2*2*2*2,2*2*23};
+long int adcexpB[]={  
+  2*2*7,2*2*2*5,2*3*7,2*2*5*7,2*3*7*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                      
+  2*31,2*2*2*2*31,2*2*2*2*2*31,2*2*2*2*2*29,2*2*29,
+  2*43,2*2,2*2,2*2*47,2*2*2*43};  
+long int UpPrimeLim=100000;
+
+typedef struct dc_view{
+  long long int vsize;
+  long int vidx;
+} DC_view;
+
+int CompareSizesByValue( const void* sz0, const void* sz1) {
+long long int *size0=(long long int*)sz0,
+              *size1=(long long int*)sz1;
+  int res=0;
+  if(*size0-*size1>0) res=1;
+  else if(*size0-*size1<0) res=-1;
+  return res;
+}
+int CompareViewsBySize( const void* vw0, const void* vw1) {
+DC_view *lvw0=(DC_view *)vw0, *lvw1=(DC_view *)vw1;
+  int res=0;
+  if(lvw0->vsize>lvw1->vsize) res=1;
+  else if(lvw0->vsize<lvw1->vsize) res=-1;
+  else if(lvw0->vidx>lvw1->vidx) res=1;
+  else if(lvw0->vidx<lvw1->vidx) res=-1;
+  return res;
+}
+
+int CalculateVeiwSizes(ADC_PAR *par){
+  unsigned long long totalInBytes = 0;
+  unsigned long long nViewDims, nCubeTuples = 0;
+ 
+  const char *adcfname=par->filename;
+  int NDID=par->ndid;
+  char clss=par->clss;
+  int dcdim=par->dim;
+  long long int tnum=par->tuplenum;
+  long long int i=0,j=0;
+  Factorization  
+    **fctlist=(Factorization **) calloc(UpPrimeLim,sizeof(Factorization *));
+  long int *prlist=(long int *) calloc(UpPrimeLim,sizeof(long int));
+  int prnum=ListFirstPrimes(UpPrimeLim,prlist);
+  DC_view *dcview=(DC_view *)calloc((1<<dcdim),sizeof(DC_view));
+  const char* vszefname0;
+  char *vszefname=NULL;
+  FILE* view=NULL;
+  int minvn=1, maxvn=(1<<dcdim), vinc=1;
+  long idx=0;
+
+  GetFactorization(prnum,prlist,fctlist); 
+  for(i=1;i<(1<<dcdim);i++){   
+    long long int LCM=1;
+    switch(clss){
+      case 'U':
+        LCM=GetLCM(i,fctlist,adcexp);
+      break;
+      case 'S':
+        LCM=GetLCM(i,fctlist,adcexpS);
+      break;
+      case 'W':
+        LCM=GetLCM(i,fctlist,adcexpW);
+      break;
+      case 'A':
+        LCM=GetLCM(i,fctlist,adcexpA);
+      break;
+      case 'B':
+        LCM=GetLCM(i,fctlist,adcexpB);
+      break;
+    }
+    if(LCM>tnum) LCM=tnum;
+    dcview[i].vsize=LCM;
+    dcview[i].vidx=i;
+  }
+  for(i=0;i<UpPrimeLim;i++){
+    if(!fctlist[i]) continue;
+    if(fctlist[i]->mlt) free(fctlist[i]->mlt); 
+    if(fctlist[i]->exp) free(fctlist[i]->exp); 
+    free(fctlist[i]);
+  }
+  free(fctlist);
+  free(prlist);
+   
+  vszefname0="view.sz";
+  vszefname=(char*)calloc(BlockSize,sizeof(char));
+  sprintf(vszefname,"%s.%s.%d",adcfname,vszefname0,NDID);
+  if(!(view = fopen(vszefname, "w+")) ) {
+    fprintf(stderr,"CalculateVeiwSizes: Can't open file: %s\n",vszefname);
+    return 0;
+  }
+  qsort( dcview, (1<<dcdim), sizeof(DC_view),CompareViewsBySize);	
+
+  switch(clss){
+    case 'U':
+      vinc=1<<3;
+    break;
+    case 'S':
+    break;
+    case 'W':
+    break;
+    case 'A':
+      vinc=1<<6;
+    break;
+    case 'B':
+      vinc=1<<14;
+    break;
+  }
+   for(i=minvn;i<maxvn;i+=vinc){   
+    nViewDims = 0;
+    fprintf(view,"Selection:");
+    idx=dcview[i].vidx;
+    for(j=0;j<dcdim;j++) 
+      if((idx>>j)&0x1==1) { fprintf(view," %lld",j+1); nViewDims++;}
+    fprintf(view,"\nView Size: %lld\n",dcview[i].vsize);
+
+    totalInBytes += (8+4*nViewDims)*dcview[i].vsize;
+    nCubeTuples += dcview[i].vsize;
+
+  }
+  fprintf(view,"\nTotal in bytes: %lld  Number of tuples: %lld\n", 
+          totalInBytes, nCubeTuples);
+  
+  fclose(view);
+  free(dcview);
+  fprintf(stdout,"View sizes are written into %s\n",vszefname);
+  free(vszefname);
+  return 1;
+}
+
+int ParseParFile(char* parfname,ADC_PAR *par){
+  char line[BlockSize];
+  FILE* parfile=NULL;
+  char* pos=strchr(parfname,'.');
+  int linenum=0,i=0;
+  const char *kwd;
+
+  if(!(parfile = fopen(parfname, "r")) ) {
+    fprintf(stderr,"ParseParFile: Can't open file: %s\n",parfname);
+    return 0;
+  }
+  if(pos) pos=strchr(pos+1,'.');
+  if(pos) sscanf(pos+1,"%d",&(par->ndid));
+  linenum=0;
+  while(fgets(&line[0],BlockSize,parfile)){
+    i=0;
+    kwd=adcKeyword[i];
+    while(kwd){
+      if(strstr(line,"#")) {
+        ;/*comment line, do nothing*/
+      }else if(strstr(line,kwd)){
+        char *pos=line+strlen(kwd)+1;
+        switch(i){
+          case 0:
+            sscanf(pos,"%d",&(par->dim));
+          break;
+          case 1:
+            sscanf(pos,"%d",&(par->mnum));
+          break;
+          case 2:
+            sscanf(pos,"%lld",&(par->tuplenum));
+          break;
+          case 3:
+/*            sscanf(pos,"%d",&(par->isascii)); */
+          break;
+          case 4:
+            sscanf(pos,"%d",&(par->inverse_endian));
+          break;
+          case 5:
+            par->filename=(char*) malloc(strlen(pos)*sizeof(char));
+            sscanf(pos,"%s",par->filename);
+          break;
+          case 6:
+            sscanf(pos,"%c",&(par->clss));
+          break;
+        }
+        break;        
+      }
+      i++;
+      kwd=adcKeyword[i];
+    }
+    linenum++;
+  }
+  fclose(parfile);
+  switch(par->clss){/* overwriting parameters according the class */
+    case 'S':
+      par->dim=5;
+      par->mnum=1;
+      par->tuplenum=1000;
+    break;
+    case 'W':
+      par->dim=10;
+      par->mnum=1;
+      par->tuplenum=100000;
+    break;
+    case 'A':
+      par->dim=15;
+      par->mnum=1;
+      par->tuplenum=1000000;
+    break;
+    case 'B':
+      par->dim=20;
+      par->mnum=1;
+      par->tuplenum=10000000;
+    break;
+  }  
+  return 1;
+}
+int WriteADCPar(ADC_PAR *par,char* fname){
+  char *lname=(char*) calloc(BlockSize,sizeof(char));
+  FILE *parfile=NULL;
+
+  sprintf(lname,"%s",fname);
+  parfile=fopen(lname,"w");
+  if(!parfile){
+    fprintf(stderr,"WriteADCPar: can't open file %s\n",lname);
+    return 0;
+  }
+  fprintf(parfile,"attrNum=%d\n",par->dim);
+  fprintf(parfile,"measuresNum=%d\n",par->mnum);
+  fprintf(parfile,"tuplesNum=%lld\n",par->tuplenum);
+  fprintf(parfile,"class=%c\n",par->clss);
+/*  fprintf(parfile,"isASCII=%d\n",par->isascii); */
+  fprintf(parfile,"INVERSE_ENDIAN=%d\n",par->inverse_endian);
+  fprintf(parfile,"fileName=%s\n",par->filename);
+  fclose(parfile);
+  return 1;
+}
+void ShowADCPar(ADC_PAR *par){
+  fprintf(stdout,"********************* ADC paramters\n");
+  fprintf(stdout," id		%d\n",par->ndid);
+  fprintf(stdout," attributes 	%d\n",par->dim);
+  fprintf(stdout," measures   	%d\n",par->mnum);
+  fprintf(stdout," tuples     	%lld\n",par->tuplenum);
+  fprintf(stdout," class	\t%c\n",par->clss);
+  fprintf(stdout," filename       %s\n",par->filename);
+  fprintf(stdout,"***********************************\n");
+}
+
+long int adcgen[]={
+  2,7,3,2,2,
+  2,2,5,31,7,
+  2,3,3,3,2,
+  5,2,2,2,3};
+  
+int GetNextTuple(int dcdim, int measnum,
+                 long long int* attr,long long int* meas,
+		 char clss){
+  static int tuplenum=0;
+  static const int maxdim=20;
+  static int measbound=31415;
+  int i=0,j=0;
+  int maxattr=0;
+  static long int seed[20];
+  long int *locexp=NULL;
+
+  if(dcdim>maxdim){
+    fprintf(stderr,"GetNextTuple: number of dcdim is too large:%d",
+                    dcdim);
+    return 0;
+  }
+  if(measnum>measbound){
+    fprintf(stderr,"GetNextTuple: number of mes is too large:%d",
+                    measnum);
+    return 0;
+  }
+  locexp=adcexp;
+  switch(clss){
+    case 'S':
+    locexp=adcexpS;
+    break;
+    case 'W':
+    locexp=adcexpW;
+    break;
+    case 'A':
+    locexp=adcexpA;
+    break;
+    case 'B':
+    locexp=adcexpB;
+    break;
+  }  
+  if(tuplenum==0){
+    for(i=0;i<dcdim;i++){
+      int tmpgen=adcgen[i];
+      for(j=0;j<locexp[i]-1;j++){
+        tmpgen*=adcgen[i];
+	tmpgen=tmpgen%adcprime[i];
+      }
+      adcgen[i]=tmpgen;
+    }
+    fprintf(stdout,"Prime \tGenerator \tSeed\n");
+    for(i=0;i<dcdim;i++){
+      seed[i]=(adcprime[i]+1)/2;
+      fprintf(stdout," %ld\t %ld\t\t %ld\n",adcprime[i],adcgen[i],seed[i]);
+     }
+  }
+  tuplenum++;
+  maxattr=0;
+  for(i=0;i<dcdim;i++){
+    attr[i]=seed[i]*adcgen[i];
+    attr[i]-=adcprime[i]*((long long int)attr[i]/adcprime[i]); 
+    seed[i]=attr[i];
+    if(seed[i]>maxattr) maxattr=seed[i];
+  }		     	  
+  for(i=0;i<measnum;i++){
+    meas[i]=(long long int)(seed[i]*maxattr);
+    meas[i]-=measbound*(meas[i]/measbound);
+  }		     	  
+  return 1;
+}
+
+int GenerateADC(ADC_PAR *par){
+  int dcdim=par->dim,
+      mesnum=par->mnum,
+      tplnum=par->tuplenum;
+  char *adcfname=(char*)calloc(BlockSize,sizeof(char));
+  
+  FILE *adc;
+  int i=0,j=0;
+  long long int* attr=NULL,*mes=NULL; 
+/*
+   if(par->isascii==1){
+    sprintf(adcfname,"%s.tpl.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "w+"))) {
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+      return 0;
+    }
+  }else{
+*/
+  sprintf(adcfname,"%s.dat.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "wb+"))){
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+       return 0;
+    }
+/*  } */
+  attr=(long long int *)malloc(dcdim*sizeof(long long int));
+  mes=(long long int *)malloc(mesnum*sizeof(long long int));
+
+  fprintf(stdout,"\nGenerateADC: writing %d tuples of %d attributes and %d measures to %s\n",
+		  tplnum,dcdim,mesnum,adcfname);
+   for(i=0;i<tplnum;i++){
+    if(!GetNextTuple(dcdim,mesnum,attr,mes,par->clss)) return 0;
+/*
+     if(par->isascii==1){
+      for(int j=0;j<dcdim;j++)fprintf(adc,"%lld ",attr[j]);
+      for(int j=0;j<mesnum;j++)fprintf(adc,"%lld ",mes[j]);
+      fprintf(adc,"\n");
+    }else{
+*/
+      for(j=0;j<mesnum;j++){ 
+    	long long mv =  mes[j];
+	    if(par->inverse_endian==1) swap8(&mv);
+	    fwrite(&mv, 8, 1, adc); 
+      }
+      for(j=0;j<dcdim;j++){ 
+    	int av = attr[j]; 
+	if(par->inverse_endian==1) swap4(&av);
+	fwrite(&av, 4, 1, adc); 
+      }
+    }
+/*  } */
+  fclose(adc);
+  fprintf(stdout,"Binary ADC file %s ",adcfname);
+  fprintf(stdout,"have been generated.\n");
+  free(attr);
+  free(mes);
+  free(adcfname);
+  CalculateVeiwSizes(par);
+  return 1;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.h
new file mode 100644
index 0000000..e11f243
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adc.h
@@ -0,0 +1,167 @@
+#if !adc_h
+#define adc_h 1
+
+/* For checking of L2-cache performance influence */ 
+/*#define IN_CORE_*/
+/*#define VIEW_FILE_OUTPUT*/ /* it can be used with IN_CORE only */
+
+/* Optimizations: prefixed views and share-sorted views */
+/*#define OPTIMIZATION*/
+
+#ifdef WINNT
+#ifndef HAS_INT64
+typedef __int64             int64;
+typedef int                 int32;
+#endif
+typedef unsigned __int64   uint64;
+typedef unsigned int       uint32;
+#else
+#ifndef HAS_INT64
+typedef long long           int64;
+typedef int                 int32;
+#endif
+typedef unsigned long long uint64;
+typedef unsigned int       uint32;
+#endif
+
+#include "adcc.h"
+#include "rbt.h"
+
+static int measbound=31415;   /* upper limit on a view measre bound */
+
+enum { smallestParent, prefixedParent, sharedSortParent, noneParent };
+
+static const char* adcKeyword[]={
+  "attrNum",
+  "measuresNum",
+  "tuplesNum",
+  "INVERSE_ENDIAN",
+  "fileName",
+  "class",
+  NULL
+};
+
+typedef struct ADCpar{
+  int ndid;
+  int dim;
+  int mnum;
+  long long int tuplenum;
+  int inverse_endian;
+  const char *filename;
+  char clss;
+} ADC_PAR;
+
+typedef struct {
+    int32 ndid;
+   char   clss;
+   char          adcName[MAX_FILE_FULL_PATH_SIZE];
+   char   adcInpFileName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 nd; 
+   uint32 nm;
+   uint32 nInputRecs;
+   uint32 memoryLimit;
+   uint32 nTasks;
+   /*  FILE *statf; */
+} ADC_VIEW_PARS;
+
+typedef struct job_pool{ 
+   uint32 grpb; 
+   uint32 nv;
+   uint32 nRows; 
+    int64 viewOffset; 
+} JOB_POOL;
+
+typedef struct layer{
+   uint32 layerIndex;
+   uint32 layerQuantityLimit;
+   uint32 layerCurrentPopulation;
+} LAYER;
+
+typedef struct chunks{
+   uint32 curChunkNum;
+    int64 chunkOffset;
+   uint32 posSubChunk;
+   uint32 curSubChunk;
+} CHUNKS;
+
+typedef struct tuplevsize {
+    uint64 viewsize;
+    uint64 tuple;
+} TUPLE_VIEWSIZE;
+
+typedef struct tupleones {
+    uint32 nOnes;
+    uint64 tuple;
+} TUPLE_ONES;
+
+typedef struct {
+   char adcName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 retCode;
+   uint32 verificationFailed;
+   uint32 swapIt;
+   uint32 nTasks;
+   uint32 taskNumber;
+    int32 ndid;
+
+   uint32 nTopDims; /* given number of dimension attributes */
+   uint32 nm;       /* number of measures */ 
+   uint32 nd;       /* number of parent's dimensions */
+   uint32 nv;       /* number of child's dimensions */
+
+   uint32 nInputRecs;
+   uint32 nViewRows; 
+   uint32 totalOfViewRows;
+   uint32 nParentViewRows;
+
+    int64 viewOffset;
+    int64 accViewFileOffset;
+
+   uint32 inpRecSize;
+   uint32 outRecSize;
+
+   uint32 memoryLimit;
+ unsigned char * memPool;
+   uint32 * inpDataBuffer;
+
+   RBTree *tree;
+
+   uint32 numberOfChunks;
+   CHUNKS *chunksParams;
+
+     char       adcLogFileName[MAX_FILE_FULL_PATH_SIZE];
+     char          inpFileName[MAX_FILE_FULL_PATH_SIZE];
+     char         viewFileName[MAX_FILE_FULL_PATH_SIZE];
+     char       chunksFileName[MAX_FILE_FULL_PATH_SIZE];
+     char      groupbyFileName[MAX_FILE_FULL_PATH_SIZE];
+     char adcViewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+     char    viewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+
+     FILE *logf;
+     FILE *inpf;
+     FILE *viewFile;   
+     FILE *fileOfChunks;
+     FILE *groupbyFile;
+     FILE *adcViewSizesFile;
+     FILE *viewSizesFile;
+   
+    int64     mSums[MAX_NUM_OF_MEAS];
+   uint32 selection[MAX_NUM_OF_DIMS];
+    int64 checksums[MAX_NUM_OF_MEAS]; /* view checksums */
+    int64 totchs[MAX_NUM_OF_MEAS];    /* checksums of a group of views */
+
+ JOB_POOL *jpp;
+    LAYER *lpp;
+   uint32 nViewLimit;
+   uint32 groupby;
+   uint32 smallestParentLevel;
+   uint32 parBinRepTuple;
+   uint32 nRowsToRead;
+   uint32 fromParent;
+
+   uint64 totalViewFileSize; /* in bytes */
+   uint32 numberOfMadeViews;
+   uint32 numberOfViewsMadeFromInput;
+   uint32 numberOfPrefixedGroupbys;
+   uint32 numberOfSharedSortGroupbys;
+} ADC_VIEW_CNTL;
+#endif /* adc_h */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adcc.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adcc.h
new file mode 100644
index 0000000..fe52718
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/adcc.h
@@ -0,0 +1,82 @@
+/*
+!-------------------------------------------------------------------------!
+!				                                    	                  !
+!		           N A S   G R I D   B E N C H M A R K S                  !
+!									                                      !
+!		                	C + +	V E R S I O N		                  !
+!									                                      !
+!			                       A D C C . H 		                      !
+!									                                      !
+!-------------------------------------------------------------------------!
+!									                                      !
+!    The the file contains comnstants definitions used for                !
+!    building veiws.                                                      !
+!									                                      !
+!    Permission to use, copy, distribute and modify this software	      !
+!    for any purpose with or without fee is hereby granted.		          !
+!    We request, however, that all derived work reference the		      !
+!    NAS Grid Benchmarks 3.0 or GridNPB3.0. This software is provided	  !
+!    "as is" without expressed or implied warranty.			              !
+!									                                      !
+!    Information on GridNPB3.0, including the concept of		          !
+!    the NAS Grid Benchmarks, the specifications, source code,  	      !
+!    results and information on how to submit new results,		          !
+!    is available at:							                          !
+!									                                      !
+!	  http://www.nas.nasa.gov/Software/NPB  			                  !
+!									                                      !
+!    Send comments or suggestions to  ngb@nas.nasa.gov  		          !
+!    Send bug reports to	      ngb@nas.nasa.gov  		              !
+!									                                      !
+!	   E-mail:  ngb@nas.nasa.gov					                      !
+!	   Fax:     (650) 604-3957					                          !
+!									                                      !
+!-------------------------------------------------------------------------!
+! GridNPB3.0 C++ version						                          !
+!	  Michael Frumkin, Leonid Shabanov				                      !
+!-------------------------------------------------------------------------!
+*/
+#ifndef _ADCC_CONST_DEFS_H_
+#define _ADCC_CONST_DEFS_H_
+
+/*#define WINNT*/
+#define UNIX
+
+#define ADC_OK                        0
+#define ADC_WRITE_FAILED              1
+#define ADC_INTERNAL_ERROR            2
+#define ADC_TREE_DESTROY_FAILURE      3
+#define ADC_FILE_OPEN_FAILURE         4
+#define ADC_MEMORY_ALLOCATION_FAILURE 5
+#define ADC_FILE_DELETE_FAILURE       6
+#define ADC_VERIFICATION_FAILED       7
+#define ADC_SHMEMORY_FAILURE          8
+
+#define SSA_BUFFER_SIZE     (1024*1024)
+#define MAX_NUMBER_OF_TASKS         256
+
+#define MAX_PAR_FILE_LINE_SIZE      512
+#define MAX_FILE_FULL_PATH_SIZE     512
+#define MAX_ADC_NAME_SIZE            32
+
+#define DIM_FSZ                       4
+#define MSR_FSZ                       8
+
+#define MAX_NUM_OF_DIMS              20
+#define MAX_NUM_OF_MEAS               4
+
+#define MAX_NUM_OF_CHUNKS          1024      
+#define MAX_PARAM_LINE_SIZE        1024
+
+#define OUTPUT_BUFFER_SIZE (MAX_NUM_OF_DIMS + (MSR_FSZ/4)*MAX_NUM_OF_MEAS)
+#define MAX_VIEW_REC_SIZE ((DIM_FSZ*MAX_NUM_OF_DIMS)+(MSR_FSZ*MAX_NUM_OF_MEAS))     
+#define MAX_VIEW_ROW_SIZE_IN_INTS (MAX_NUM_OF_DIMS + 2*MAX_NUM_OF_MEAS)
+#define MLB32  0x80000000
+
+#ifdef WINNT
+#define MLB    0x8000000000000000
+#else
+#define MLB 0x8000000000000000LL
+#endif
+
+#endif /*  _ADCC_CONST_DEFS_H_ */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/dc.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/dc.c
new file mode 100644
index 0000000..6984897
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/dc.c
@@ -0,0 +1,292 @@
+/*
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   D C                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    DC creates all specifided data-cube views.                           !
+!    Refer to NAS Technical Report 03-005 for details.                    !
+!    It calculates all groupbys in a top down manner using well known     !
+!    heuristics and optimizations.                                        !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+! Author: Michael Frumkin                                                 !
+!         Leonid Shabanov                                                 !
+!-------------------------------------------------------------------------!
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <ctype.h>
+#include <math.h>
+
+#include "adc.h"
+#include "macrodef.h"
+#include "npbparams.h"
+
+#ifdef UNIX
+#include <sys/types.h>
+#include <unistd.h>
+
+#define MAX_TIMERS 64  /* NPB maximum timers */
+  void    timer_clear(int);
+  void    timer_start(int);
+  void    timer_stop(int); 
+  double  timer_read(int);
+#endif
+
+void c_print_results( char   *name,
+                      char   clss,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+void initADCpar(ADC_PAR *par);
+int ParseParFile(char* parfname, ADC_PAR *par); 
+int GenerateADC(ADC_PAR *par);
+void ShowADCPar(ADC_PAR *par);
+int32 DC(ADC_VIEW_PARS *adcpp);
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp);
+
+#define BlockSize 1024
+
+int main ( int argc, char * argv[] ) 
+{
+  ADC_PAR *parp;
+  ADC_VIEW_PARS *adcpp;
+  int32 retCode;
+
+  fprintf(stdout,"\n\n NAS Parallel Benchmarks (NPB3.3-SER) - DC Benchmark\n\n" );
+  if(argc!=3){
+    fprintf(stdout," No Paramter file. Using compiled defaults\n");
+  }
+  if(argc>3 || (argc>1 && !isdigit(argv[1][0]))){
+    fprintf(stderr,"Usage: <program name> <amount of memory>\n");
+    fprintf(stderr,"       <file of parameters>\n");
+    fprintf(stderr,"Example: bin/dc.S 1000000 DC/ADC.par\n");
+    fprintf(stderr,"The last argument, (a parameter file) can be skipped\n");
+    exit(1);
+  }
+
+  if(  !(parp = (ADC_PAR*) malloc(sizeof(ADC_PAR)))
+     ||!(adcpp = (ADC_VIEW_PARS*) malloc(sizeof(ADC_VIEW_PARS)))){
+     PutErrMsg("main: malloc failed")
+     exit(1);
+  }
+  initADCpar(parp);
+  parp->clss=CLASS;
+  if(argc!=3){
+    parp->dim=attrnum;
+    parp->tuplenum=input_tuples;    
+  }else if( (argc==3)&&(!ParseParFile(argv[2], parp))) {
+    PutErrMsg("main.ParseParFile failed")
+    exit(1);
+  }
+  ShowADCPar(parp); 
+  if(!GenerateADC(parp)) {
+     PutErrMsg("main.GenerateAdc failed")
+     exit(1);
+  }
+
+  adcpp->ndid = parp->ndid;  
+  adcpp->clss = parp->clss;
+  adcpp->nd = parp->dim;
+  adcpp->nm = parp->mnum;
+  adcpp->nTasks = 1;
+  if(argc>=2)
+    adcpp->memoryLimit = atoi(argv[1]);
+  else
+    adcpp->memoryLimit = 0;
+  if(adcpp->memoryLimit <= 0){
+    /* size of rb-tree with tuplenum nodes */
+    adcpp->memoryLimit = parp->tuplenum*(50+5*parp->dim); 
+    fprintf(stdout,"Estimated rb-tree size = %d \n", adcpp->memoryLimit);
+  }
+  adcpp->nInputRecs = parp->tuplenum;
+  strcpy(adcpp->adcName, parp->filename);
+  strcpy(adcpp->adcInpFileName, parp->filename);
+
+  if((retCode=DC(adcpp))) {
+     PutErrMsg("main.DC failed")
+     fprintf(stderr, "main.ParRun failed: retcode = %d\n", retCode);
+     exit(1);
+  }
+
+  if(parp)  { free(parp);   parp = 0; }
+  if(adcpp) { free(adcpp); adcpp = 0; }
+  return 0;
+}
+
+int32		 CloseAdcView(ADC_VIEW_CNTL *adccntl);  
+int32		 PartitionCube(ADC_VIEW_CNTL *avp);				
+ADC_VIEW_CNTL *NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum);
+int32		 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntl);
+
+int32 DC(ADC_VIEW_PARS *adcpp) {
+   int32 itsk=0;
+   double t_total=0.0;
+   int verified;
+
+   typedef struct { 
+      int    verificationFailed;
+      uint32 totalViewTuples;
+      uint64 totalViewSizesInBytes;
+      uint32 totalNumberOfMadeViews;
+      uint64 checksum;
+      double tm_max;
+   } PAR_VIEW_ST;
+   
+   PAR_VIEW_ST *pvstp;
+   ADC_VIEW_CNTL *adccntlp;
+
+   pvstp = (PAR_VIEW_ST*) malloc(sizeof(PAR_VIEW_ST));
+   pvstp->verificationFailed = 0;
+   pvstp->totalViewTuples = 0;
+   pvstp->totalViewSizesInBytes = 0;
+   pvstp->totalNumberOfMadeViews = 0;
+   pvstp->checksum = 0;
+   
+   adccntlp = NewAdcViewCntl(adcpp, itsk);
+   if (!adccntlp) { 
+      PutErrMsg("ParRun.NewAdcViewCntl: returned NULL")
+      return ADC_INTERNAL_ERROR;
+   }else{
+     if (adccntlp->retCode!=0) {
+   	fprintf(stderr, 
+   		 "DC.NewAdcViewCntl: return code = %d\n",
+   						adccntlp->retCode); 
+     }
+   }
+   if( PartitionCube(adccntlp) ) {
+      PutErrMsg("DC.PartitionCube failed");
+   }
+   timer_clear(itsk);
+   timer_start(itsk);
+   if( ComputeGivenGroupbys(adccntlp) ) {
+      PutErrMsg("DC.ComputeGivenGroupbys failed");
+   }
+   timer_stop(itsk);
+   pvstp->tm_max = timer_read(itsk);
+   pvstp->verificationFailed += adccntlp->verificationFailed;
+   if (!adccntlp->verificationFailed) {
+     pvstp->totalNumberOfMadeViews += adccntlp->numberOfMadeViews;
+     pvstp->totalViewSizesInBytes += adccntlp->totalViewFileSize;
+     pvstp->totalViewTuples += adccntlp->totalOfViewRows;
+     pvstp->checksum += adccntlp->totchs[0];
+   }   
+   if(CloseAdcView(adccntlp)) {
+     PutErrMsg("ParRun.CloseAdcView: is failed");
+     adccntlp->verificationFailed = 1;
+   }
+
+   t_total=pvstp->tm_max; 
+
+   pvstp->verificationFailed=Verify(pvstp->checksum,adcpp);
+   verified = (pvstp->verificationFailed == -1)? -1 :
+              (pvstp->verificationFailed ==  0)?  1 : 0;
+
+   fprintf(stdout,"\n*** DC Benchmark Results:\n");
+   fprintf(stdout," Benchmark Time   = %20.3f\n", t_total);
+   fprintf(stdout," Input Tuples     =         %12d\n", (int) adcpp->nInputRecs);
+   fprintf(stdout," Number of Views  =         %12d\n",
+           (int) pvstp->totalNumberOfMadeViews);
+   fprintf(stdout," Number of Tasks  =         %12d\n", (int) adcpp->nTasks);
+   fprintf(stdout," Tuples Generated = %20.0f\n",
+           (double) pvstp->totalViewTuples);
+   fprintf(stdout," Tuples/s         = %20.2f\n", 
+           (double) pvstp->totalViewTuples / t_total);
+   fprintf(stdout," Checksum         = %20.12e\n", (double) pvstp->checksum);
+   if (pvstp->verificationFailed)
+      fprintf(stdout, " Verification failed\n");
+
+   c_print_results("DC",
+  		   adcpp->clss,
+  		   (int)adcpp->nInputRecs,
+                   0,
+                   0,
+                   1,
+  		   t_total,
+  		   (double) pvstp->totalViewTuples * 1.e-6 / t_total, 
+  		   "Tuples generated", 
+  		   verified,
+  		   NPBVERSION,
+  		   COMPILETIME,
+  		   CC,
+  		   CLINK,
+  		   C_LIB,
+  		   C_INC,
+  		   CFLAGS,
+  		   CLINKFLAGS); 
+   return ADC_OK;
+}
+
+long long checksumS=464620213;
+long long checksumWlo=434318;
+long long checksumWhi=1401796;
+long long checksumAlo=178042;
+long long checksumAhi=7141688;
+long long checksumBlo=700453;
+long long checksumBhi=9348365;
+
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp){
+  switch(adcpp->clss){
+    case 'S':
+      if(checksum==checksumS) return 0;
+      break;
+    case 'W':
+      if(checksum==checksumWlo+1000000*checksumWhi) return 0;
+      break;
+    case 'A':
+      if(checksum==checksumAlo+1000000*checksumAhi) return 0;
+      break;
+    case 'B':
+      if(checksum==checksumBlo+1000000*checksumBhi) return 0;
+      break;
+    default:
+      return -1; /* CLASS U */
+  }
+  return 1;
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/extbuild.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/extbuild.c
new file mode 100644
index 0000000..3550537
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/extbuild.c
@@ -0,0 +1,988 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+#include "protots.h"
+
+#ifdef UNIX
+#include <errno.h>
+#endif
+
+extern int32 computeChecksum(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+extern int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+
+int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf){
+  uint32 iRec = 0;
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize, ib = 0;
+
+  FSEEK(inpf, 0L, SEEK_SET);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  while(fread(&avp->inpDataBuffer[ib], inpRecSize, 1, inpf)){
+     iRec++;
+     ib += inpBufferPace;      
+  }
+  avp->nRowsToRead = iRec;
+  FSEEK(inpf, 0L, SEEK_SET);
+  
+  if(avp->nInputRecs != iRec){
+     fprintf(stderr, " ReadWholeInputData(): wrong input data reading.\n");
+     return ADC_INTERNAL_ERROR;
+  }  
+  return ADC_OK;
+}
+int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp){
+  uint32 iRec = 0;
+  uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize,ib;
+  uint64 ordern=0;
+#ifdef VIEW_FILE_OUTPUT
+  uint32 retCode;
+#endif
+
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+
+  ib=0;
+  for ( iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      SelectToView( &avp->inpDataBuffer[ib], avp->selection, viewBuf, 
+  		             avp->nd, avp->nm, avp->nv );
+      ib += inpBufferPace;
+      TreeInsert(avp->tree, viewBuf);
+      if(avp->tree->memoryIsFull){
+  	fprintf(stderr, "ComputeMemoryFittedView(): Not enough memory.\n");
+  	return 1; 
+      }
+  }
+
+#ifdef VIEW_FILE_OUTPUT
+  if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern) ){ 
+    fprintf(stderr, "ComputeMemoryFittedView() Write error is occured.\n");
+    return retCode;
+  }
+#else
+  computeChecksum(avp,avp->tree->root.left,&ordern);
+#endif
+ 
+  avp->nViewRows = avp->tree->count;
+  avp->totalOfViewRows += avp->nViewRows; 			      
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+  return ADC_OK;
+}
+
+int32 SharedSortAggregate(ADC_VIEW_CNTL *avp){
+   int32 retCode;
+  uint32 iRec = 0;
+  uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+  uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   int64 chunkOffset = 0;
+   int64 inpfOffset;
+  uint32 nPart = 0;
+  uint32 prevV;
+  uint32 currV;
+  uint32 total = 0;
+  unsigned char *ib;
+  uint32 ibsize = SSA_BUFFER_SIZE;
+  uint32 nib;
+  uint32 iib;
+  uint32 nreg;
+  uint32 nlst;
+  uint32 nsgs;
+  uint32 ncur;
+  uint32 ibOffset = 0;
+  uint64 ordern=0;
+   
+  ib = (unsigned char*) malloc(ibsize); 
+  if (!ib){ 
+    fprintf(stderr,"SharedSortAggregate: memory allocation failed\n"); 
+    return ADC_MEMORY_ALLOCATION_FAILURE; 
+  }
+  
+  nib = ibsize/avp->inpRecSize;
+  nsgs = avp->nRowsToRead/nib;
+  
+  if (nsgs == 0){
+      nreg = avp->nRowsToRead; 
+      nlst = nreg; 
+      nsgs = 1; 
+  }else{
+     nreg = nib;
+     if (avp->nRowsToRead%nib) {
+       nsgs++; 
+       nlst = avp->nRowsToRead%nib;
+     }else{
+       nlst = nreg;			   
+     }
+  }
+  
+  avp->nViewRows = 0; 
+  for( iib = 1; iib <= nsgs; iib++ ){ 
+    if(iib > 1) FSEEK(avp->viewFile, inpfOffset, SEEK_SET);
+    if( iib == nsgs ) ncur = nlst; else ncur = nreg;
+    	  
+    fread(ib, ncur*avp->inpRecSize, 1, avp->viewFile);
+    inpfOffset = ftell(avp->viewFile);
+
+    for( ibOffset = 0, iRec = 1; iRec <= ncur; iRec++ ){
+      memcpy(attrs, &ib[ibOffset], avp->inpRecSize);
+      ibOffset += avp->inpRecSize;
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv); 
+      currV = currBuf[2*avp->nm];
+
+      if(iib == 1 && iRec == 1){ 
+        prevV = currV; 
+        nPart = 1;
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+        TreeInsert(avp->tree, currBuf);
+      }else{
+         if (currV == prevV){
+            nPart++;
+	    TreeInsert (avp->tree, currBuf);
+            if (avp->tree->memoryIsFull){
+	      avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	                                             avp->tree->count;
+	      avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+              (avp->numberOfChunks)++;
+	      if(avp->numberOfChunks >= MAX_NUM_OF_CHUNKS){
+                fprintf(stderr,"Too many chunks were created.\n"); 
+		exit(1);
+              }
+              chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+              retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+	                               avp->tree->root.left, avp->logf);                                       
+              if(retCode!=ADC_OK){
+		fprintf(stderr,"SharedSortAggregate: Write error occured.\n"); 
+		return retCode;
+	      }
+              InitializeTree(avp->tree, avp->nv, avp->nm);
+	    } /* memoryIsFull */
+         }else{
+	   if(avp->numberOfChunks && avp->tree->count!=0){ 
+	     avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	        				     avp->tree->count;
+	     avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+             (avp->numberOfChunks)++;
+             chunkOffset += 
+	    	      (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+	     retCode=WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	   				 avp->tree->root.left, avp->logf);
+             if(retCode!=ADC_OK){
+	       fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+	       return retCode;    
+	      }
+	    }
+            FSEEK(avp->viewFile, 0L, SEEK_END);
+            if(!avp->numberOfChunks){
+               avp->nViewRows += avp->tree->count;
+	       retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern);
+	       if(retCode!=ADC_OK){ 
+	          fprintf(stderr, 
+	        	 "SharedSortAggregate: Write error occured.\n");
+	          return retCode;
+	       }
+ 	     }else{
+	       retCode=MultiWayMerge(avp);
+	       if(retCode!=ADC_OK) {
+	         fprintf(stderr,"SharedSortAggregate.MultiWayMerge: failed.\n");
+	         return retCode;
+	       } 
+	     }
+             InitializeTree(avp->tree, avp->nv, avp->nm);
+             TreeInsert(avp->tree, currBuf);
+             total += nPart;
+             nPart = 1;
+          }
+       }
+       prevV = currV;
+    } /* iRec */
+  } /* iib */
+
+  if(avp->numberOfChunks && avp->tree->count!=0) { 
+    avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+    avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+    (avp->numberOfChunks)++;
+    chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+    retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+    			     avp->tree->root.left, avp->logf);
+    if(retCode!=ADC_OK){
+      fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+      return retCode;	 
+    }
+  }
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  if(!avp->numberOfChunks){
+    avp->nViewRows += avp->tree->count;
+    if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern)){ 
+      fprintf(stderr, "SharedSortAggregate: Write error occured.\n");
+      return retCode;
+    }	 
+  }else{
+     retCode=MultiWayMerge(avp);
+     if(retCode!=ADC_OK) {
+       fprintf(stderr,"SharedSortAggregate.MultiWayMerge failed.\n");
+       return retCode;
+     } 
+  }
+  FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+  
+  total += nPart;
+  avp->totalOfViewRows += avp->nViewRows;
+  if(ib) free(ib);
+  return  ADC_OK;
+}
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof){
+   uint32 i;
+   uint32 iRec = 0;
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 aggrBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 prevBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+    int64 *aggrmp;
+    int64 *currmp;
+    int32 compRes;
+   uint32 nOut = 0; 
+   uint32 mpOffset = 0;
+   uint32 nOutBufRecs;
+   uint32 nViewRows = 0;
+    int64 inpfOffset;
+
+    aggrmp = (int64*) &aggrBuf[0];
+    currmp = (int64*) &currBuf[0];
+    
+    for(i = 0; i < 2*avp->nm+avp->nv; i++){prevBuf[i] = 0; aggrBuf[i] = 0;}
+    nOutBufRecs = avp->memoryLimit/avp->outRecSize;
+
+    for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      fread(attrs, avp->inpRecSize, 1, iof);
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv);
+      if (iRec == 1) memcpy(aggrBuf, currBuf, avp->outRecSize);
+      else{
+       compRes = KeyComp( &currBuf[2*avp->nm], &prevBuf[2*avp->nm], avp->nv);
+
+       switch(compRes){
+	  case  1: 
+	    memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+	    mpOffset += avp->outRecSize;
+	    nOut++;
+	    for ( i = 0; i < avp->nm; i++ ){
+	      avp->mSums[i] += aggrmp[i];
+	      avp->checksums[i] += nOut*aggrmp[i]%measbound;
+	    }    
+	    memcpy(aggrBuf, currBuf, avp->outRecSize);
+	    break;
+	  case  0: 
+	    for ( i = 0; i < avp->nm; i++ ) aggrmp[i] += currmp[i];
+	    break;
+	  case -1: 
+	    fprintf(stderr,"PrefixedAggregate: wrong parent view order.\n"); 
+	    exit(1);
+	    break; 
+	  default: 
+	    fprintf(stderr,"PrefixedAggregate: wrong KeyComp() result.\n"); 
+	    exit(1);
+	    break;
+       }     
+    
+       if (nOut == nOutBufRecs){
+	     inpfOffset = ftell(iof);
+	     FSEEK(iof, 0L, SEEK_END);
+	     WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+	     FSEEK(iof, inpfOffset, SEEK_SET);
+	     mpOffset = 0;
+	     nViewRows += nOut;
+	     nOut = 0; 
+       }
+     }
+     memcpy(prevBuf, currBuf, avp->outRecSize);
+   }
+   memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+   nOut++;
+   for ( i = 0; i < avp->nm; i++ ){
+     avp->mSums[i] += aggrmp[i];
+     avp->checksums[i] += nOut*aggrmp[i]%measbound;
+   }
+   FSEEK(iof, 0L, SEEK_END);
+   WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+   avp->nViewRows	 = nViewRows+nOut;
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+int32 RunFormation (ADC_VIEW_CNTL *avp, FILE *inpf){
+   uint32 iRec = 0;
+   uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+    int64 chunkOffset = 0;
+
+   InitializeTree(avp->tree, avp->nv, avp->nm);
+
+   for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+     fread(attrs, avp->inpRecSize, 1, inpf);
+     SelectToView(attrs, avp->selection, viewBuf, avp->nd, avp->nm, avp->nv); 
+     TreeInsert(avp->tree, viewBuf);
+
+     if(avp->tree->memoryIsFull) {
+        avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+	    avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;		 
+        (avp->numberOfChunks)++;
+	    if (avp->numberOfChunks >= MAX_NUM_OF_CHUNKS) {
+          fprintf(stderr, "RunFormation: Too many chunks were created.\n"); 
+          return ADC_INTERNAL_ERROR;
+        }
+        chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+        if(WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	                         avp->tree->root.left, avp->logf )){
+	       fprintf(stderr, 
+	         "RunFormation.WriteChunkToDisk: Write error is occured.\n");
+	       return ADC_WRITE_FAILED;
+	    }
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+       }
+   } /* Insertion ... */
+   if(avp->numberOfChunks && avp->tree->count!=0) { 
+     avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+     avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;
+     (avp->numberOfChunks)++;
+     chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+     if(WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+                         avp->tree->root.left, avp->logf)){
+       fprintf(stderr, 
+            "RunFormation(.WriteChunkToDisk: Write error is occured.\n");
+       return ADC_WRITE_FAILED;  
+     }
+   }
+   FSEEK(avp->viewFile, 0L, SEEK_END);
+   return ADC_OK;
+}
+void SeekAndReadNextSubChunk( uint32 multiChunkBuffer[], 
+                              uint32 k,
+                              FILE *inFile,
+		              uint32 chunkRecSize, 
+		              uint64 inFileOffs,
+		              uint32 subChunkNum){
+   int64 ret;
+  
+   ret = FSEEK(inFile, inFileOffs, SEEK_SET);
+   if (ret < 0){
+      fprintf(stderr,"SeekAndReadNextSubChunk.fseek() < 0 "); 
+      exit(1); 
+   }
+   fread(&multiChunkBuffer[k], chunkRecSize*subChunkNum, 1, inFile);
+}
+void ReadSubChunk(
+            uint32 chunkRecSize,
+            uint32 *multiChunkBuffer,
+            uint32 mwBufRecSizeInInt,
+            uint32 iChunk,
+            uint32 regSubChunkSize,
+            CHUNKS *chunks,  
+              FILE *fileOfChunks
+            ){
+   if (chunks[iChunk].curChunkNum > 0){
+      if(chunks[iChunk].curChunkNum < regSubChunkSize){
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			(iChunk*regSubChunkSize +
+	   			(regSubChunkSize-chunks[iChunk].curChunkNum))*
+	   			mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			chunks[iChunk].curChunkNum);
+	chunks[iChunk].posSubChunk=regSubChunkSize-chunks[iChunk].curChunkNum;
+	chunks[iChunk].curSubChunk=chunks[iChunk].curChunkNum;
+	chunks[iChunk].curChunkNum=0;
+	chunks[iChunk].chunkOffset=-1;
+      }else{
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			iChunk*regSubChunkSize*mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			regSubChunkSize);
+	chunks[iChunk].posSubChunk = 0;
+	chunks[iChunk].curSubChunk = regSubChunkSize;
+	chunks[iChunk].curChunkNum -= regSubChunkSize;
+	chunks[iChunk].chunkOffset += regSubChunkSize * chunkRecSize;
+      }
+   }
+}
+int32 MultiWayMerge(ADC_VIEW_CNTL *avp){
+   uint32 outputBuffer[OUTPUT_BUFFER_SIZE];
+   uint32 r_buf       [OUTPUT_BUFFER_SIZE];
+   uint32 min_r_buf   [OUTPUT_BUFFER_SIZE];
+   uint32 first_one;
+   uint32 i;
+   uint32 iChunk;
+   uint32 min_r_chunk;
+   uint32 sPos;
+   uint32 iPos;
+   uint32 numEmptyBufs;
+   uint32 numEmptyRuns;
+   uint32 mwBufRecSizeInInt;
+   uint32 chunkRecSize;
+   uint32 *multiChunkBuffer;
+   uint32   regSubChunkSize;
+    int32 compRes;
+    int64 *m_min_r_buf;
+    int64 *m_outputBuffer;
+
+   FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+   first_one = 1;
+   avp->nViewRows  = 0; 
+
+   chunkRecSize = avp->outRecSize;
+   mwBufRecSizeInInt = chunkRecSize/4;
+   m_min_r_buf = (int64*)&min_r_buf[0];
+   m_outputBuffer = (int64*)&outputBuffer[0];
+
+   mwBufRecSizeInInt = chunkRecSize/4;
+   regSubChunkSize = (avp->memoryLimit/avp->numberOfChunks)/chunkRecSize;
+	 
+   if (regSubChunkSize==0) {
+     fprintf(stderr,
+             "MultiWayMerge: Not enough memory to run the external sort\n");
+     return ADC_INTERNAL_ERROR;
+   }
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+
+   for(i = 0; i < avp->numberOfChunks; i++ ){
+      ReadSubChunk( 
+                   chunkRecSize,
+                   multiChunkBuffer,
+                   mwBufRecSizeInInt,
+                   i,
+                   regSubChunkSize,
+                   avp->chunksParams,  
+                   avp->fileOfChunks
+      );
+   }
+   while(1){
+     for(iChunk = 0;iChunk<avp->numberOfChunks;iChunk++){
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+     	sPos = iChunk*regSubChunkSize*mwBufRecSizeInInt;
+    	iPos = sPos+mwBufRecSizeInInt*avp->chunksParams[iChunk].posSubChunk;
+     	memcpy(&min_r_buf[0], &multiChunkBuffer[iPos], avp->outRecSize);
+	    min_r_chunk = iChunk;
+     	break;
+       }
+     }
+     for ( iChunk = min_r_chunk; iChunk < avp->numberOfChunks; iChunk++ ){
+       uint32 iPos;
+
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+          iPos = mwBufRecSizeInInt*(iChunk*regSubChunkSize+
+                                   avp->chunksParams[iChunk].posSubChunk);
+          memcpy(&r_buf[0],&multiChunkBuffer[iPos],avp->outRecSize);
+
+          compRes=KeyComp(&r_buf[2*avp->nm],&min_r_buf[2*avp->nm],avp->nv);	
+          if(compRes < 0) {
+     	      memcpy(&min_r_buf[0], &r_buf[0], avp->outRecSize);
+	          min_r_chunk = iChunk;
+          }
+       }
+     }
+     /* Step forward */
+     if(avp->chunksParams[min_r_chunk].curSubChunk != 0){
+       avp->chunksParams[min_r_chunk].curSubChunk--;
+       avp->chunksParams[min_r_chunk].posSubChunk++;
+     }
+
+       /* Aggreagation if a duplicate is encountered */
+       if(first_one){
+         memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize);
+         first_one = 0;
+       }else{
+         compRes = KeyComp( &outputBuffer[2*avp->nm], 
+        		    &min_r_buf[2*avp->nm], avp->nv );
+         if(!compRes){
+           for(i = 0; i < avp->nm; i++ ){ 
+             m_outputBuffer[i] += m_min_r_buf[i]; 
+           }
+         }else{
+           WriteToFile(outputBuffer,avp->outRecSize,1,avp->viewFile,stderr);
+           avp->nViewRows++;
+           for(i=0;i<avp->nm;i++){
+	     avp->mSums[i]+=m_outputBuffer[i];
+	     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+	   }
+           memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize );
+        }
+      }
+
+      for(numEmptyBufs = 0, 
+          numEmptyRuns = 0, i = 0; i < avp->numberOfChunks; i++ ){
+	     if (avp->chunksParams[i].curSubChunk == 0) numEmptyBufs++;
+         if (avp->chunksParams[i].curChunkNum == 0) numEmptyRuns++;
+      }
+      if(   numEmptyBufs == avp->numberOfChunks 
+          &&numEmptyRuns == avp->numberOfChunks) break;
+
+      if(avp->chunksParams[min_r_chunk].curSubChunk == 0) {
+        ReadSubChunk( 
+        	 chunkRecSize,
+        	 multiChunkBuffer,
+        	 mwBufRecSizeInInt,
+        	 min_r_chunk,
+        	 regSubChunkSize,
+        	 avp->chunksParams,
+        	 avp->fileOfChunks);
+      }
+   } /* while(1) */
+
+   WriteToFile( outputBuffer, avp->outRecSize, 1, avp->viewFile, stderr);	  
+   avp->nViewRows++;
+   for(i = 0; i < avp->nm; i++ ){ 
+     avp->mSums[i] += m_outputBuffer[i]; 
+     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+   }
+
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+void SelectToView( uint32 * ib, uint32 *ix, uint32 *viewBuf, 
+                   uint32 nd, uint32 nm, uint32 nv ){
+   uint32 i, j;
+   for ( j = 0, i = 0; i < nv; i++ ) viewBuf[2*nm+j++] = ib[2*nm+ix[i]-1];
+   memcpy(&viewBuf[0], &ib[0], MSR_FSZ*nm);
+}
+FILE * AdcFileOpen(const char *fileName, const char *mode){
+   FILE *fr;
+   if ((fr = (FILE*) fopen(fileName, mode))==NULL)
+      fprintf(stderr, "AdcFileOpen: Cannot open the file %s errno = %d\n",  
+                       fileName, errno);
+   return fr;
+}
+void AdcFileName(char *adcFileName, const char *adcName, 
+		 const char *fileName, uint32 taskNumber){
+  sprintf(adcFileName, "%s.%s.%d",adcName,fileName,taskNumber);
+}
+ADC_VIEW_CNTL * NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum){
+   ADC_VIEW_CNTL *adccntl;
+   uint32 i, j, k;
+#ifdef IN_CORE
+   uint32 ux;
+#endif
+   char id[8+1];
+   
+   adccntl = (ADC_VIEW_CNTL *) malloc(sizeof(ADC_VIEW_CNTL));
+   if (adccntl==NULL) return NULL;
+   
+   adccntl->ndid = adcpp->ndid;
+   adccntl->taskNumber = pnum;
+   adccntl->retCode = 0;
+   adccntl->swapIt = 0;
+   strcpy(adccntl->adcName, adcpp->adcName);
+   adccntl->nTopDims = adcpp->nd;
+   adccntl->nd = adcpp->nd;
+   adccntl->nm = adcpp->nm;
+   adccntl->nInputRecs = adcpp->nInputRecs;
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->accViewFileOffset = 0;
+   adccntl->totalViewFileSize = 0;
+   adccntl->numberOfMadeViews = 0;
+   adccntl->numberOfViewsMadeFromInput = 0;
+   adccntl->numberOfPrefixedGroupbys = 0;
+   adccntl->numberOfSharedSortGroupbys = 0;
+   adccntl->totalOfViewRows = 0;
+   adccntl->memoryLimit = adcpp->memoryLimit;
+   adccntl->nTasks = adcpp->nTasks;
+   strcpy(adccntl->inpFileName, adcpp->adcInpFileName);
+   sprintf(id, ".%d", adcpp->ndid);
+   
+   AdcFileName(adccntl->adcLogFileName, 
+               adccntl->adcName, "logf", adccntl->taskNumber);
+   strcat(adccntl->adcLogFileName, id);            
+   adccntl->logf = AdcFileOpen(adccntl->adcLogFileName, "w");
+
+   AdcFileName(adccntl->inpFileName, adccntl->adcName, "dat", adcpp->ndid);
+   adccntl->inpf = AdcFileOpen(adccntl->inpFileName, "rb");
+   if(!adccntl->inpf){ 
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE; 
+     return(adccntl);
+   } 
+
+   AdcFileName(adccntl->viewFileName, adccntl->adcName, 
+               "view.dat", adccntl->taskNumber);
+   strcat(adccntl->viewFileName, id);            
+   adccntl->viewFile = AdcFileOpen(adccntl->viewFileName, "wb+");
+
+   AdcFileName(adccntl->chunksFileName, adccntl->adcName, 
+               "chunks.dat", adccntl->taskNumber);
+   strcat(adccntl->chunksFileName, id);            
+   adccntl->fileOfChunks = AdcFileOpen(adccntl->chunksFileName,"wb+");
+
+   AdcFileName(adccntl->groupbyFileName, adccntl->adcName, 
+               "groupby.dat", adccntl->taskNumber);
+   strcat(adccntl->groupbyFileName, id);
+   adccntl->groupbyFile = AdcFileOpen(adccntl->groupbyFileName,"wb+");
+
+   AdcFileName(adccntl->adcViewSizesFileName, adccntl->adcName, 
+               "view.sz", adcpp->ndid);
+   adccntl->adcViewSizesFile = AdcFileOpen(adccntl->adcViewSizesFileName,"r");
+   if(!adccntl->adcViewSizesFile){
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE;
+     return(adccntl);
+   }
+
+   AdcFileName(adccntl->viewSizesFileName, adccntl->adcName, 
+               "viewsz.dat", adccntl->taskNumber);
+   strcat(adccntl->viewSizesFileName, id);            
+   adccntl->viewSizesFile = AdcFileOpen(adccntl->viewSizesFileName, "wb+");
+   
+   adccntl->chunksParams = (CHUNKS*) malloc(MAX_NUM_OF_CHUNKS*sizeof(CHUNKS));
+   if(adccntl->chunksParams==NULL){ 
+     fprintf(adccntl->logf,"NewAdcViewCntl: Cannot allocate 'chunksParsms'\n");
+     adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+     return(adccntl);
+   }
+   adccntl->memPool = (unsigned char*) malloc(adccntl->memoryLimit);
+   if(adccntl->memPool == NULL ){
+      fprintf(adccntl->logf, 
+              "NewAdcViewCntl: Cannot allocate 'main memory pool'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+   
+#ifdef IN_CORE   
+   /* add a condition to allocate this memory buffer, THIS is IMPORTANT */
+   ux = 4*adccntl->nTopDims + 8*adccntl->nm;
+   if (adccntl->nTopDims%8) ux += 4;
+   adccntl->inpDataBuffer = (uint32*) malloc(adccntl->nInputRecs*ux);
+   if(adccntl->inpDataBuffer == NULL ){
+      fprintf(adccntl->logf,
+              "NewAdcViewCntl: Cannot allocate 'input data buffer'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+#endif
+   adccntl->numberOfChunks = 0;
+
+   for ( i = 0; i < adccntl->nm; i++ ){
+     adccntl->mSums[i] = 0;
+     adccntl->checksums[i] = 0;
+     adccntl->totchs[i] = 0;
+  }
+   adccntl->tree = CreateEmptyTree(adccntl->nd, adccntl->nm, 
+                                   adccntl->memoryLimit, adccntl->memPool);
+   if(!adccntl->tree){
+      fprintf(adccntl->logf,"\nNewAdcViewCntl.CreateEmptyTree failed.\n");
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   adccntl->nv = adcpp->nd; /* default */
+   for ( i = 0; i < adccntl->nv; i++ ) adccntl->selection[i]=i+1;
+   
+   adccntl->nViewLimit = (1<<adcpp->nd)-1;
+   adccntl->jpp=(JOB_POOL *) malloc((adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+   if ( adccntl->jpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a job pool.", 
+        (long)(adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE; 
+      return(adccntl);
+   }
+   adccntl->lpp = (LAYER * ) malloc( (adcpp->nd+1)*sizeof(LAYER));
+   if ( adccntl->lpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a layer reference array.", 
+        (long)(adcpp->nd+1)*sizeof(LAYER));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   for ( j = 1, i = 1; i <= adcpp->nd; i++ ) {
+      k =  NumOfCombsFromNbyK ( adcpp->nd, i );
+      adccntl->lpp[i].layerIndex = j;
+      j += k;
+      adccntl->lpp[i].layerQuantityLimit = k;
+      adccntl->lpp[i].layerCurrentPopulation = 0;
+   }    
+      
+   JobPoolInit ( adccntl->jpp, (adccntl->nViewLimit+1), adcpp->nd );
+
+   fprintf(adccntl->logf,"\nMeaning of the log file colums is as follows:\n");
+   fprintf(adccntl->logf,
+     "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+   adccntl->verificationFailed = 1;
+   return adccntl;
+}
+void InitAdcViewCntl(ADC_VIEW_CNTL *adccntl, 
+		     uint32 nSelectedDims, 
+		     uint32 *selection, 
+		     uint32 fromParent ){
+   uint32 i;
+   
+   adccntl->nv = nSelectedDims;
+   
+   for (i = 0; i < adccntl->nm; i++ ) adccntl->mSums[i] = 0;
+   for (i = 0; i < adccntl->nv; i++ ) adccntl->selection[i] = selection[i];
+
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->numberOfChunks = 0;
+   adccntl->fromParent = fromParent;
+   adccntl->nViewRows = 0;
+
+   if(fromParent){
+     adccntl->nd = adccntl->smallestParentLevel;
+     FSEEK(adccntl->viewFile, adccntl->viewOffset, SEEK_SET);
+     adccntl->nRowsToRead = adccntl->nParentViewRows;
+   }else{
+     adccntl->nd = adccntl->nTopDims;
+     adccntl->nRowsToRead = adccntl->nInputRecs;
+   }
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+}
+int32 CloseAdcView(ADC_VIEW_CNTL *adccntl){
+   if (adccntl->inpf) fclose(adccntl->inpf);
+   if (adccntl->viewFile) fclose(adccntl->viewFile);
+   if (adccntl->fileOfChunks) fclose(adccntl->fileOfChunks);
+   if (adccntl->groupbyFile) fclose(adccntl->groupbyFile);
+   if (adccntl->adcViewSizesFile) fclose(adccntl->adcViewSizesFile);
+   if (adccntl->viewSizesFile) fclose(adccntl->viewSizesFile);
+   
+   if (DeleteOneFile(adccntl->chunksFileName))       
+      return ADC_FILE_DELETE_FAILURE;
+   if (DeleteOneFile(adccntl->viewSizesFileName))    
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (DeleteOneFile(adccntl->groupbyFileName))      
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (adccntl->chunksParams){ 
+     free(adccntl->chunksParams); 
+     adccntl->chunksParams=NULL; 
+   }  
+   if (adccntl->memPool){ free(adccntl->memPool); adccntl->memPool=NULL;} 
+   if (adccntl->jpp){ free(adccntl->jpp); adccntl->jpp=NULL; } 
+   if (adccntl->lpp){ free(adccntl->lpp); adccntl->lpp=NULL; } 
+
+   if (adccntl->logf) fclose(adccntl->logf);
+   free(adccntl);
+   return ADC_OK;
+}
+void AdcCntlLog(ADC_VIEW_CNTL *adccntlp){
+  fprintf(adccntlp->logf,"    memoryLimit = %20d\n",
+    adccntlp->memoryLimit);
+  fprintf(adccntlp->logf,"    treeNodeSize = %20d\n",
+    adccntlp->tree->treeNodeSize);
+  fprintf(adccntlp->logf," treeMemoryLimit = %20d\n",
+    adccntlp->tree->memoryLimit);
+  fprintf(adccntlp->logf,"    nNodesLimit = %20d\n",
+    adccntlp->tree->nNodesLimit);
+  fprintf(adccntlp->logf,"freeNodeCounter = %20d\n",
+    adccntlp->tree->freeNodeCounter);
+  fprintf(adccntlp->logf,"	nViewRows = %20d\n",
+    adccntlp->nViewRows);
+}
+int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp){
+     char inps[MAX_PARAM_LINE_SIZE];
+     char msg[64];
+     uint32 *viewCounts;
+     uint32 selection_viewSize[2];
+     uint32 sz;
+     uint32 sel[64];
+     uint32 i;
+     uint32 k;
+     uint64 tx;
+     uint32 iTx; 
+   
+     viewCounts = (uint32 *) &adccntlp->memPool[0];
+     for ( i = 0; i <= adccntlp->nViewLimit; i++) viewCounts[i] = 0;
+     
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     FSEEK(adccntlp->adcViewSizesFile, 0L, SEEK_SET);     
+
+     while(fread(selection_viewSize, 8, 1, adccntlp->viewSizesFile)){
+        viewCounts[selection_viewSize[0]] = selection_viewSize[1];
+     }
+     k = 0;
+     while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps) != EOF ){
+        if ( strcmp(inps, "Selection:") == 0 ) {
+           while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps)) {
+             if ( strcmp(inps, "View") == 0 ) break; 
+             sel[k++] = atoi(inps);	  
+           }
+        }
+        
+        if ( strcmp(inps, "Size:") == 0 ) {
+           fscanf(adccntlp->adcViewSizesFile, "%s", inps);
+           sz = atoi(inps);
+           CreateBinTuple(&tx, sel, k);
+           iTx = (int32)(tx>>(64-adccntlp->nTopDims)); 
+           adccntlp->verificationFailed = 0;
+           if (!adccntlp->numberOfMadeViews) adccntlp->verificationFailed = 1;
+
+           if ( viewCounts[iTx] != 0){
+              if (viewCounts[iTx] != sz) {
+                 if (viewCounts[iTx] != adccntlp->nInputRecs){
+                   fprintf(adccntlp->logf, 
+                           "A view size is wrong: genSz=%d calcSz=%d\n",
+                   	                               sz, viewCounts[iTx]);
+                   adccntlp->verificationFailed = 1;
+                   return ADC_VERIFICATION_FAILED;
+                 }
+              }               
+           }
+           k = 0;
+        }  
+     } /* of while() */
+
+     fprintf(adccntlp->logf,
+       "\n\nMeaning of the log file colums is as follows:\n");
+     fprintf(adccntlp->logf, 
+       "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+     if (!adccntlp->verificationFailed) 
+          strcpy(msg, "Verification=passed");
+     else strcpy(msg, "Verification=failed");
+     FSEEK(adccntlp->logf, 0L, SEEK_SET);
+     fprintf(adccntlp->logf, msg);
+     FSEEK(adccntlp->logf, 0L, SEEK_END);
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     return ADC_OK;
+}
+int32 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntlp){
+    int32 retCode;
+   uint32 i;
+   uint64 binRepTuple;
+   uint32 ut32;
+   uint32 nViews = 0;
+   uint32 nSelectedDims;
+   uint32 smp;
+#ifdef IN_CORE
+   uint32 firstView = 1;
+#endif
+   uint32 selection_viewsize[2];
+   char ttout[16];
+
+   while (fread(&binRepTuple, 8, 1, adccntlp->groupbyFile )){
+     for(i = 0; i < adccntlp->nm; i++) adccntlp->checksums[i]=0;
+     nViews++;
+     swap8(&binRepTuple);
+
+     GetRegTupleFromBin64(binRepTuple, adccntlp->selection,
+                          adccntlp->nTopDims, &nSelectedDims);
+     ut32 = (uint32)(binRepTuple>>(64-adccntlp->nTopDims));
+     selection_viewsize[0] = ut32;
+     ut32 <<= (32-adccntlp->nTopDims);
+     adccntlp->groupby = ut32;
+#ifndef IN_CORE
+     smp = GetParent(adccntlp, ut32);
+#endif
+#ifdef IN_CORE
+     if (firstView) {
+       firstView = 0;
+       if(ReadWholeInputData(adccntlp, adccntlp->inpf)) {
+          fprintf(stderr, "ReadWholeInputData failed.\n");
+          return ADC_INTERNAL_ERROR;   
+       }
+     }
+     smp = noneParent;
+#endif
+
+     if (smp != noneParent)
+     GetRegTupleFromParent(binRepTuple, 
+                           adccntlp->parBinRepTuple, 
+                           adccntlp->selection,
+                           adccntlp->nTopDims);
+     InitAdcViewCntl(adccntlp, nSelectedDims, 
+                     adccntlp->selection, (smp == noneParent)?0:1);
+#ifdef IN_CORE
+      if(retCode = ComputeMemoryFittedView(adccntlp)) {
+         fprintf(stderr, "ComputeMemoryFittedView failed.\n");
+         return retCode;
+      }
+#else
+#ifdef OPTIMIZATION
+     if (smp == prefixedParent){
+        if (retCode = PrefixedAggregate(adccntlp, adccntlp->viewFile)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.PrefixedAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfPrefixedGroupbys++;
+     }else if (smp == sharedSortParent) {
+        if (retCode = SharedSortAggregate(adccntlp)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.SharedSortAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfSharedSortGroupbys++;
+     }else
+#endif /* OPTIMIZATION */     
+     { 
+        if( smp != noneParent ) {
+	  retCode = RunFormation(adccntlp, adccntlp->viewFile);
+          if(retCode!=ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode; 
+            }
+	  }else{
+	    if ((retCode=RunFormation (adccntlp, adccntlp->inpf)) != ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode;
+            }
+	    adccntlp->numberOfViewsMadeFromInput++;
+	  }
+        if(!adccntlp->numberOfChunks){
+          uint64 ordern=0;
+          adccntlp->nViewRows        = adccntlp->tree->count;
+          adccntlp->totalOfViewRows += adccntlp->nViewRows;
+	  retCode=WriteViewToDiskCS(adccntlp,adccntlp->tree->root.left,&ordern);
+	  if(retCode!=ADC_OK){
+            fprintf(stderr,
+	            "ComputeGivenGroupbys.WriteViewToDisk: Write error.\n");
+	    return ADC_WRITE_FAILED;
+	  }
+        }else { 
+          retCode=MultiWayMerge(adccntlp);
+          if(retCode!=ADC_OK) {
+	     fprintf(stderr,"ComputeGivenGroupbys.MultiWayMerge failed.\n");
+	     return retCode;
+	  } 
+        } 
+      }
+     
+     JobPoolUpdate(adccntlp);
+
+     adccntlp->accViewFileOffset += 
+       (int64)(adccntlp->nViewRows*adccntlp->outRecSize);
+     FSEEK(adccntlp->fileOfChunks, 0L, SEEK_SET);
+     FSEEK(adccntlp->inpf, 0L, SEEK_SET);
+#endif /* IN_CORE */
+     for( i = 0; i < adccntlp->nm; i++) 
+       adccntlp->totchs[i]+=adccntlp->checksums[i];
+     selection_viewsize[1] = adccntlp->nViewRows;
+     fwrite(selection_viewsize, 8, 1, adccntlp->viewSizesFile);
+     adccntlp->totalViewFileSize += 
+                            adccntlp->outRecSize*adccntlp->nViewRows;
+     sprintf(ttout, "%7d ", nViews);
+     WriteOne32Tuple(ttout, adccntlp->groupby, 
+                     adccntlp->nTopDims, adccntlp->logf);
+     fprintf(adccntlp->logf, " |  %15d | ", adccntlp->nViewRows); 
+     for ( i = 0; i < adccntlp->nm; i++ ){ 
+        fprintf(adccntlp->logf, " %20lld", adccntlp->checksums[i]);
+     }
+     fprintf(adccntlp->logf, " | %5d", adccntlp->numberOfChunks);
+   }
+   adccntlp->numberOfMadeViews = nViews;  
+   if(ViewSizesVerification(adccntlp)) return ADC_VERIFICATION_FAILED;
+   return ADC_OK;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/jobcntl.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/jobcntl.c
new file mode 100644
index 0000000..c8e7dcd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/jobcntl.c
@@ -0,0 +1,563 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+
+#ifdef UNIX
+#include <fcntl.h>
+#include <sys/file.h>
+/*#include <sys/resource.h>*/
+#include <unistd.h>
+#endif
+
+uint32 NumberOfOnes(uint64 s);
+void swap8(void *a);
+void SetOneBit(uint64 *s, int32 pos){ uint64 ob = MLB; ob >>= pos; *s |= ob;}
+void SetOneBit32(uint32 *s, uint32 pos){ 
+   uint32 ob = 0x80000000;
+   ob >>= pos; 
+   *s |= ob;
+}
+uint32 Mlo32(uint32 x){
+   uint32 om = 0x80000000;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 0, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om >>= 1;
+       k++;
+   } 
+   return(k);   
+}
+int32 mro32(uint32 x){
+   uint32 om = 0x00000001;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 32, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om <<= 1;
+       k--;
+   } 
+   return(k);   
+}
+uint32 setLeadingOnes32(uint32 n){
+    int32 om = 0x80000000;
+   uint32 x;
+   uint32 i;
+         
+   for ( x = 0, i = 0; i < n; i++ ) {
+         x |= om;
+         om >>= 1;
+   } 
+   return (x);
+}
+int32 DeleteOneFile(const char * file_name) {
+#  ifdef WINNT
+      return(remove(file_name));
+#  else
+      return(unlink(file_name));
+#  endif
+}
+void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf) {
+  uint64 ob = MLB32;
+  uint32 i;
+            
+  fprintf(logf, "\n %s", t);
+  for ( i = 0; i < l; i++ ) {
+    if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+    ob >>= 1;
+  }
+}
+uint32 NumOfCombsFromNbyK( uint32 n, uint32 k ){
+  uint32 l, combsNbyK;
+  if ( k > n ) return 0;
+  for(combsNbyK=1, l=1;l<=k;l++)combsNbyK = combsNbyK*(n-l+1)/l;
+  return  combsNbyK;
+}
+void JobPoolUpdate(ADC_VIEW_CNTL *avp){
+   uint32 l = avp->nv;
+   uint32 k;
+  
+   k = avp->lpp[l].layerIndex + avp->lpp[l].layerCurrentPopulation;
+   avp->jpp[k].grpb = avp->groupby;
+   avp->jpp[k].nv = l;
+   avp->jpp[k].nRows = avp->nViewRows;
+   avp->jpp[k].viewOffset = avp->accViewFileOffset;
+   avp->lpp[l].layerCurrentPopulation++;
+} 
+int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 level, levelPop, i;
+   uint32 ig;
+   uint32 igOfSmallestParent;
+   uint32 igOfPrefixedParent;
+   uint32 igOfSharedSortParent;
+   uint32 spMinNumOfRows;
+   uint32 pfMinNumOfRows;
+   uint32 ssMinNumOfRows;
+   uint32 tgrpb;
+   uint32 pg;
+   uint32 pfm;
+   uint32 mlo = 0;
+   uint32 lom;
+   uint32 l = NumberOfOnes(binRepTuple);
+   uint32 spFound;
+   uint32 pfFound;
+   uint32 ssFound;
+   uint32 found;
+   uint32 spFt;
+   uint32 pfFt;   
+   uint32 ssFt;
+
+   found = noneParent;
+   pfm = setLeadingOnes32(mro32(avp->groupby));
+   SetOneBit32(&mlo, Mlo32(avp->groupby));
+   lom = setLeadingOnes32(Mlo32(avp->groupby)); 
+
+   for(spFound=pfFound=ssFound=0, level=l;level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      
+      if(levelPop != 0);
+      {
+           for ( spFt = pfFt = ssFt = 1, ig = avp->lpp[level].layerIndex,
+                 i = 0; i < levelPop; i++ )
+           {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+                  spFound = 1;
+                  if (spFt) { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; spFt = 0; }
+                  else   if ( spMinNumOfRows > avp->jpp[ig].nRows ) 
+                            { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; }
+
+				  pg = tgrpb & pfm;
+				  if (pg == binRepTuple) {
+                     pfFound = 1;
+                     if (pfFt) { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; pfFt = 0; }
+                     else   if ( pfMinNumOfRows > avp->jpp[ig].nRows) 
+                               { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; }
+				  }
+
+				  if ( (tgrpb & mlo) && !(tgrpb & lom)) {
+                     ssFound = 1;
+                     if (ssFt) { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; ssFt = 0; }
+                     else   if ( ssMinNumOfRows > avp->jpp[ig].nRows) 
+                               { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if (pfFound) found = prefixedParent;
+      else if (ssFound) found = sharedSortParent;
+           else if (spFound) found = smallestParent;
+
+      switch(found){
+         case prefixedParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset      = avp->jpp[igOfPrefixedParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfPrefixedParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfPrefixedParent].grpb;
+           break;
+         case sharedSortParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSharedSortParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSharedSortParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSharedSortParent].grpb;
+           break;
+         case smallestParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSmallestParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSmallestParent].grpb;
+           break;
+         default: break;
+      }
+      if(   found == prefixedParent 
+         || found == sharedSortParent 
+	 || found == smallestParent) break;
+   }
+  return found;
+} 
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 l = NumberOfOnes(binRepTuple);
+  
+   for(found=0, level=l; level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      if(levelPop){
+        for(ft=1, ig=avp->lpp[level].layerIndex, i=0;i<levelPop;i++){
+          tgrpb = avp->jpp[ig].grpb;
+          if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+            found = 1;
+            if(ft){
+	      minNumOfRows=avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig; 
+	      ft = 0;
+	    }else if(minNumOfRows > avp->jpp[ig].nRows){ 
+	      minNumOfRows = avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig;
+	    }
+          }
+          ig++;
+        }
+      }
+      if( found ){      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+   return found;
+} 
+int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 pg, tm;
+   uint32 l = NumberOfOnes(binRepTuple);
+   
+   tm = setLeadingOnes32(mro32(avp->groupby));
+
+   for(found=0, level=l; level<=avp->nTopDims; level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+  
+      if (levelPop != 0);
+      {
+           for(ft = 1, ig = avp->lpp[level].layerIndex, 
+                i = 0; i < levelPop; i++ ) {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+				  pg = tgrpb & tm;
+				  if (pg == binRepTuple) {
+                     found = 1;
+                     if (ft) { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; ft = 0; }
+                     else if ( minNumOfRows > avp->jpp[ig].nRows) 
+                             { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if ( found ) {      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+  return found;
+} 
+void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd){
+  uint32 i;
+
+  for ( i = 0; i < n; i++ ) {
+      jpp[i].grpb = 0;
+	  jpp[i].nv = 0;  
+      jpp[i].nRows = 0;
+      jpp[i].viewOffset = 0;
+  }    
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf){
+   uint64 ob = MLB;
+   uint32 i;
+            
+   fprintf(logf, "\n %s", t);
+   for ( i = 0; i < l; i++ ) {
+      if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+      ob >>= 1;
+   }
+}
+uint32 NumberOfOnes(uint64 s){
+   uint64 ob = MLB;
+   uint32 i;
+   uint32 nOnes;
+
+   for ( nOnes = 0, i = 0; i < 64; i++ ) {
+      if (s&ob) nOnes++;
+      ob >>= 1;
+   }
+   return nOnes;
+}
+void GetRegTupleFromBin64(
+           uint64 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint64 oc = MLB;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;  
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void getRegTupleFromBin32(
+           uint32 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint32 oc = MLB32;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void GetRegTupleFromParent(
+               uint64 bin64RepTuple,
+               uint32 bin32RepTuple, 
+	       uint32 *selTuple,
+	       uint32 nd){
+   uint32 oc = MLB32;
+   uint32 i, j, k;
+   uint32 ut32; 
+  
+   ut32 = (uint32)(bin64RepTuple>>(64-nd)); 
+   ut32 <<= (32-nd);
+   
+   for ( j = 0, k = 0, i = 0; i < nd; i++ ) {
+     if (bin32RepTuple & oc) k++;
+     if (bin32RepTuple & oc && ut32 & oc) selTuple[j++] = k; 
+     oc >>= 1;
+   }    
+}
+void CreateBinTuple(uint64 *binRepTuple, uint32 *selTuple, uint32 numDims){
+   uint32 i;
+
+   *binRepTuple = 0;
+   for(i = 0; i < numDims; i++ ){
+     SetOneBit( binRepTuple, selTuple[i]-1 );
+   }    
+}
+void d32v( char * t, uint32 *v, uint32 n){
+   uint32 i;
+   
+   fprintf(stderr,"\n%s ", t);
+   for ( i = 0; i < n; i++ ) fprintf(stderr," %d", v[i]);
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+int32 Comp8gbuf(const void *a, const void *b){
+   if ( a < b ) return -1;
+   else if (a > b) return 1;
+   else return 0;
+}
+void restore(TUPLE_VIEWSIZE x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint64 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].viewsize < x[tj].viewsize) m = tj+1;
+      else m = tj;
+      mm1 = m - 1;
+      jm1 = j - 1;
+      if ( x[mm1].viewsize > x[jm1].viewsize ) {
+         iW = x[mm1].viewsize; 
+	 x[mm1].viewsize = x[jm1].viewsize; 
+	 x[jm1].viewsize = iW;  
+         iW64 = x[mm1].tuple; 
+	 x[mm1].tuple = x[jm1].tuple; 
+	 x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void vszsort( TUPLE_VIEWSIZE x[], uint32 n){
+  int32 i, im1;
+  uint64 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restore( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].viewsize; x[0].viewsize = x[im1].viewsize; x[im1].viewsize = iW;  
+     iW64 = x[0].tuple; x[0].tuple = x[im1].tuple; x[im1].tuple = iW64;  
+     restore( x, 1, im1);
+  }
+}
+uint32 countTupleOnes(uint64 binRepTuple, uint32 numDims){
+  uint32 i, cnt = 0;
+  uint64 ob = 0x0000000000000001; 
+
+  for(i = 0; i < numDims; i++ ){
+    if ( binRepTuple&ob) cnt++;
+    ob <<= 1;
+  }    
+  return cnt;
+}
+void restoreo( TUPLE_ONES x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint32 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].nOnes < x[tj].nOnes) m = tj+1;
+      else m = tj;
+      mm1 = m - 1; jm1 = j - 1;
+      if ( x[mm1].nOnes > x[jm1].nOnes ){
+         iW = x[mm1].nOnes;
+	     x[mm1].nOnes = x[jm1].nOnes; 
+	     x[jm1].nOnes = iW;  
+         iW64 = x[mm1].tuple; 
+	     x[mm1].tuple = x[jm1].tuple; 
+	     x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void onessort( TUPLE_ONES x[], uint32 n){
+   int32 i, im1;
+  uint32 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restoreo( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].nOnes; 
+     x[0].nOnes = x[im1].nOnes; 
+     x[im1].nOnes = iW;  
+     iW64 = x[0].tuple; 
+     x[0].tuple = x[im1].tuple; 
+     x[im1].tuple = iW64;  
+     restoreo( x, 1, im1);
+  }
+}
+uint32 MultiFileProcJobs( TUPLE_VIEWSIZE *tuplesAndSizes, 
+		                          uint32 nViews, 
+                           ADC_VIEW_CNTL *avp ){
+   uint32 i;
+    int32 ii; /* it should be int */
+   uint32 j;
+   uint32 pn;
+   uint32 direction = 0;
+   uint32 dChange = 0;
+   uint32 gbi;
+   uint32 maxn;
+   uint64 *gbuf;
+   uint64      vszs[MAX_NUMBER_OF_TASKS];
+   uint32 nGroupbys[MAX_NUMBER_OF_TASKS];
+   TUPLE_ONES *toptr;
+
+   gbuf = (uint64*) &avp->memPool[0];
+
+   for(i = 0; i < avp->nTasks; i++ ){ nGroupbys[i] = 0; vszs[i] = 0; }
+
+   for(pn = 0, gbi = 0, ii = nViews-1; ii >= 0; ii-- ){
+     if(pn == avp->taskNumber) gbuf[gbi++]=tuplesAndSizes[ii].tuple;
+     nGroupbys[pn]++;
+     vszs[pn] += tuplesAndSizes[ii].viewsize; 
+     if(direction == 0 && pn == avp->nTasks-1 ) { 
+       direction = 1; 
+       dChange = 1; 
+     }
+     if(direction == 1 && pn == 0 ){ 
+       direction = 0; 
+       dChange = 1; 
+     }
+     if (!dChange){ if (direction) pn--; else pn++;}
+     dChange = 0;
+   }
+   for(maxn = 0, i = 0; i < avp->nTasks; i++) 
+     if (nGroupbys[i] > maxn) maxn = nGroupbys[i];
+
+   toptr = (TUPLE_ONES*) malloc(sizeof(TUPLE_ONES)*maxn);
+   if(!toptr) return 1; 
+
+   for(i = 0; i < avp->nTasks; i++ ){
+     if(i == avp->taskNumber){
+       for(j = 0; j < nGroupbys[i]; j++ ){
+         toptr[j].tuple = gbuf[j];
+         toptr[j].nOnes  = countTupleOnes(gbuf[j], avp->nTopDims);
+       }
+       qsort((void*)gbuf,  nGroupbys[i], 8, Comp8gbuf );
+       onessort(toptr, nGroupbys[i]);
+
+       for(j = 0; j < nGroupbys[i]; j++){
+         toptr[nGroupbys[i]-1-j].tuple <<= (64-avp->nTopDims);
+         swap8(&toptr[nGroupbys[i]-1-j].tuple);
+         fwrite(&toptr[nGroupbys[i]-1-j].tuple, 8, 1, avp->groupbyFile);
+       }
+     }
+   }
+   FSEEK(avp->groupbyFile, 0L, SEEK_SET);
+   if (toptr) free(toptr);
+   return 0;
+}
+int32 PartitionCube(ADC_VIEW_CNTL *avp){
+    TUPLE_VIEWSIZE *tuplesAndSizes;
+    uint32 it = 0;
+    uint64 sz;
+    uint32 sel[64];
+    uint32 k;
+    uint64 tx;
+    uint32 i;
+      char inps[256];
+      
+    tuplesAndSizes = 
+       (TUPLE_VIEWSIZE*) malloc(avp->nViewLimit*sizeof(TUPLE_VIEWSIZE));
+    if(tuplesAndSizes == NULL){
+       fprintf(stderr," PartitionCube(): memory allocation failure'\n");
+       return ADC_MEMORY_ALLOCATION_FAILURE;
+    }
+    k = 0;
+    while( fscanf(avp->adcViewSizesFile, "%s", inps) != EOF ){
+       if( strcmp(inps, "Selection:") == 0 ) {
+         while ( fscanf(avp->adcViewSizesFile, "%s", inps)) {
+           if ( strcmp(inps, "View") == 0 ) break; 
+           sel[k++] = atoi(inps);	
+         }
+       }
+       if( strcmp(inps, "Size:") == 0 ){
+         fscanf(avp->adcViewSizesFile, "%s", inps);
+         sz = atoi(inps);
+         CreateBinTuple(&tx, sel, k);
+         if (sz > avp->nInputRecs) sz = avp->nInputRecs;
+         tuplesAndSizes[it].viewsize = sz;
+         tuplesAndSizes[it].tuple = tx; 
+         it++;
+         k = 0;
+       }  
+    }
+    vszsort(tuplesAndSizes, it);
+    for( i = 0; i < it; i++){
+        tuplesAndSizes[i].tuple >>= (64-avp->nTopDims);
+    }
+    if(MultiFileProcJobs( tuplesAndSizes, it, avp )){
+       fprintf(stderr, "MultiFileProcJobs() is failed \n");
+       fprintf(avp->logf, "MultiFileProcJobs() is failed.\n");
+       fflush(avp->logf);
+       return 1;
+    }
+    FSEEK(avp->adcViewSizesFile, 0L, SEEK_SET);
+    free(tuplesAndSizes);
+    return 0;
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/macrodef.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/macrodef.h
new file mode 100644
index 0000000..ce67695
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/macrodef.h
@@ -0,0 +1,14 @@
+#define PutErrMsg(msg) {fprintf(stderr," %s, errno = %d\n", msg, errno);}
+
+#define WriteToFile(ptr,size,nitems,stream,logf) if( fwrite(ptr,size,nitems,stream) != nitems )\
+       {\
+        fprintf(stderr,"\n Write error from WriteToFile()\n"); return ADC_WRITE_FAILED; \
+       }
+
+#ifdef WINNT
+#define FSEEK(stream,offset,whence)  fseek(stream, (long)offset,whence);
+#else
+#define FSEEK(stream,offset,whence)  fseek(stream,offset,whence); 
+#endif
+
+#define GetRecSize(nd,nm) (DIM_FSZ*nd+MSR_FSZ*nm)
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/protots.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/protots.h
new file mode 100644
index 0000000..6ff92a7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/protots.h
@@ -0,0 +1,100 @@
+ int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf);
+ 
+ int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp);
+
+ int32 MultiWayMerge(ADC_VIEW_CNTL *avp);
+
+ int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+ int32 DeleteOneFile(const char * file_name);
+
+  void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+
+ int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp);
+
+  void CreateBinTuple(
+       uint64  *binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims);
+
+  void AdcCntlLog(ADC_VIEW_CNTL *adccntlp);
+
+  void swap8(void *a);
+
+  void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf);
+
+  void JobPoolUpdate(ADC_VIEW_CNTL *avp);
+
+ int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t);
+
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+  void GetRegTupleFromBin64(
+       uint64   binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims, 
+       uint32  *numOfUnits); 
+
+  void GetRegTupleFromParent(
+       uint64   bin64RepTuple,
+       uint32   bin32RepTuple,
+       uint32  *selTuple,
+       uint32   nd);
+
+  void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd);
+
+uint32 NumOfCombsFromNbyK (uint32 n, uint32 k);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 CheckTree(
+       treeNode  *t , 
+       uint32    *px, 
+       uint32     nv, 
+       uint32     nm, 
+       FILE      *logFile);
+
+ int32 KeyComp(const uint32 *a, const uint32 *b, uint32 n);
+
+ int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+  void SelectToView(
+       uint32  *ib, 
+       uint32  *ix, 
+       uint32  *viewBuf, 
+       uint32   nd, 
+       uint32   nm, 
+       uint32   nv);
+
+ int32 MultiWayBufferSnap(
+       uint32   nv, 
+       uint32   nm,  
+       uint32  *multiChunkBuffer, 
+       uint32	numberOfChunks, 
+       uint32	regSubChunkSize, 
+       uint32	nRecords);
+
+ RBTree *CreateEmptyTree(
+       uint32          nd, 
+       uint32          nm, 
+       uint32          memoryLimit, 
+       unsigned char  *memPool);
+
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof);
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.c
new file mode 100644
index 0000000..ae96e45
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.c
@@ -0,0 +1,240 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "adc.h"
+#include "macrodef.h"
+
+int32 KeyComp( const uint32 *a, const uint32 *b, uint32 n ) {
+  uint32 i;
+  for ( i = 0; i < n; i++ ) {
+    if (a[i] < b[i]) return(-1);
+    else if (a[i] > b[i]) return(1);
+  }
+  return(0);
+}
+int32 TreeInsert(RBTree *tree, uint32 *attrs){
+   uint32  sl = 1;			    	
+   uint32 *attrsP;
+    int32  cmpres;
+ treeNode *xNd, *yNd, *tmp;
+
+  tmp = &tree->root;
+  xNd = tmp->left;
+
+  if (xNd == NULL){
+    tree->count++;
+    NEW_TREE_NODE(tree->mp,tree->memPool,
+        	      tree->memaddr,tree->treeNodeSize,
+        	      tree->freeNodeCounter,tree->memoryIsFull)
+    xNd = tmp->left = tree->mp;
+    memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+    xNd->left = xNd->right = NULL;
+    xNd->clr = BLACK;
+    return 0;
+  }
+
+  tree->drcts[0] = 0;
+  tree->nodes[0] = &tree->root;
+
+  while(1){
+    attrsP = (uint32*) &(xNd->nodeMemPool[tree->nm]);
+    cmpres = KeyComp( &attrs[tree->nm<<1], attrsP, tree->nd );
+
+    if (cmpres < 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 0;
+      yNd = xNd->left;
+
+      if(yNd == NULL){
+	    NEW_TREE_NODE(tree->mp,tree->memPool,
+	  	              tree->memaddr,tree->treeNodeSize,
+	  	              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->left = tree->mp;
+        break;
+      }
+    }else if (cmpres > 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 1;
+      yNd = xNd->right;
+      if(yNd == NULL){
+        NEW_TREE_NODE(tree->mp,tree->memPool,
+		              tree->memaddr,tree->treeNodeSize,
+		              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->right = tree->mp; 
+        break;
+      }
+    }else{  
+      uint64 ii; 
+      int64 *mx;
+      mx = (int64*) &attrs[0];
+      for ( ii = 0; ii < tree->nm; ii++ ) xNd->nodeMemPool[ii] += mx[ii];
+      return 0; 
+    }
+    xNd = yNd;
+  }
+  tree->count++;
+  memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+  xNd->left = xNd->right = NULL;
+  xNd->clr  = RED;
+
+  while(1){
+    if ( tree->nodes[sl-1]->clr != RED || sl<3 ) break;
+      
+    if (tree->drcts[sl-2] == 0){
+      yNd = tree->nodes[sl-2]->right;
+      if (yNd != NULL && yNd->clr == RED){
+        tree->nodes[sl-1]->clr = BLACK;
+        yNd->clr = BLACK;
+        tree->nodes[sl-2]->clr = RED;
+        sl -= 2;
+      }else{
+        if (tree->drcts[sl-1] == 1){
+	      xNd = tree->nodes[sl-1];
+	      yNd = xNd->right;
+	      xNd->right = yNd->left;
+	      yNd->left  = xNd;
+	      tree->nodes[sl-2]->left = yNd;
+        }else
+          yNd = tree->nodes[sl-1];
+	  
+        xNd = tree->nodes[sl-2];
+        xNd->clr = RED;
+        yNd->clr = BLACK;
+
+        xNd->left  = yNd->right;
+        yNd->right = xNd;
+
+        if(tree->drcts[sl-3])
+          tree->nodes[sl-3]->right = yNd;
+	    else  
+          tree->nodes[sl-3]->left = yNd;
+        break;
+      }
+    }else{
+      yNd = tree->nodes[sl-2]->left;
+      if (yNd != NULL && yNd->clr == RED){
+         tree->nodes[sl-1]->clr = BLACK;
+         yNd->clr = BLACK;
+         tree->nodes[sl-2]->clr = RED;
+         sl -= 2;
+      }else{
+    	if(tree->drcts[sl-1] == 0){
+          xNd = tree->nodes[sl-1];
+          yNd = xNd->left;
+          xNd->left  = yNd->right;
+          yNd->right = xNd;
+          tree->nodes[sl-2]->right = yNd;
+   	    }else
+          yNd = tree->nodes[sl-1];
+
+   	    xNd = tree->nodes[sl-2];
+     	xNd->clr = RED;
+    	yNd->clr = BLACK;
+
+    	xNd->right = yNd->left;
+    	yNd->left  = xNd;
+
+   	    if (tree->drcts[sl-3])
+   	      tree->nodes[sl-3]->right = yNd;
+     	else  
+   	      tree->nodes[sl-3]->left  = yNd;
+   	    break;
+      }
+    }
+  }
+  tree->root.left->clr = BLACK;
+  return 0;
+}
+int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDisk( avp, t->left)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDisk( avp, t->right)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDiskCS( avp, t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDiskCS( avp, t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 computeChecksum(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(computeChecksum(avp,t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  if(computeChecksum(avp,t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteChunkToDisk(uint32 recordSize,FILE *fileOfChunks,
+		       treeNode *t, FILE *logFile){   
+  if(!t) return ADC_OK;
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->left, logFile)) 
+    return ADC_WRITE_FAILED; 
+  WriteToFile( t->nodeMemPool, recordSize, 1, fileOfChunks, logFile);
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->right, logFile)) 
+    return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+RBTree * CreateEmptyTree(uint32 nd, uint32 nm, 
+                         uint32 memoryLimit, unsigned char * memPool){
+  RBTree *tree = (RBTree*)  malloc(sizeof(RBTree));
+  if (!tree) return NULL;
+
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryLimit = memoryLimit;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+  tree->memPool = memPool;
+  tree->nodes = (treeNode**) malloc(sizeof(treeNode*)*MAX_TREE_HEIGHT);
+  if (!(tree->nodes)) return NULL;
+  tree->drcts = (uint32*) malloc( sizeof(uint32)*MAX_TREE_HEIGHT);
+  if (!(tree->drcts)) return NULL;
+  return tree;
+}
+void InitializeTree(RBTree *tree, uint32 nd, uint32 nm){
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+}
+int32 DestroyTree(RBTree *tree) {
+  if (tree==NULL) return ADC_TREE_DESTROY_FAILURE;
+  if (tree->memPool!=NULL) free(tree->memPool);
+  if (tree->nodes) free(tree->nodes);
+  if (tree->drcts) free(tree->drcts);
+  free(tree);
+  return ADC_OK;
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.h
new file mode 100644
index 0000000..de4f997
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/DC/rbt.h
@@ -0,0 +1,43 @@
+#ifndef _ADC_PARVIEW_TREE_DEF_H_
+#define _ADC_PARVIEW_TREE_DEF_H_
+
+#define MAX_TREE_HEIGHT	64
+enum{BLACK,RED};
+
+typedef struct treeNode{
+  struct treeNode *left;
+  struct treeNode *right;
+  uint32 clr;
+  int64 nodeMemPool[1];
+} treeNode;
+
+typedef struct RBTree{
+  treeNode root;	
+  treeNode * mp;
+  uint32 count;       
+  uint32 treeNodeSize;
+  uint32 nodeDataSize;
+  uint32 memoryLimit; 
+  uint32 memaddr;
+  uint32 memoryIsFull;
+  uint32 freeNodeCounter;
+  uint32 nNodesLimit;
+  uint32 nd;
+  uint32 nm;
+  uint32   *drcts;
+  treeNode **nodes;
+  unsigned char * memPool;
+} RBTree;
+
+#define NEW_TREE_NODE(node_ptr,memPool,memaddr,treeNodeSize, \
+ freeNodeCounter,memoryIsFull) \
+ node_ptr=(struct treeNode*)(memPool+memaddr); \
+ memaddr+=treeNodeSize; \
+ (freeNodeCounter)--; \
+ if( freeNodeCounter == 0 ) { \
+     memoryIsFull = 1; \
+ }
+
+int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+#endif /* _ADC_PARVIEW_TREE_DEF_H_ */
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/Makefile
new file mode 100644
index 0000000..e763d62
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/Makefile
@@ -0,0 +1,24 @@
+SHELL=/bin/sh
+BENCHMARK=ep
+BENCHMARKU=EP
+
+include ../config/make.def
+
+OBJS = ep.o ${COMMON}/print_results.o ${COMMON}/${RAND}.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+
+ep.o:		ep.f npbparams.h
+	${FCOMPILE} ep.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/README
new file mode 100644
index 0000000..0ca487c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/README
@@ -0,0 +1,4 @@
+This code implements the random-number generator described in the
+NAS Parallel Benchmark document RNR Technical Report RNR-94-007.
+The code is "embarrassingly" parallel in that no communication is
+required for the generation of the random numbers itself. 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/ep.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/ep.f
new file mode 100644
index 0000000..a8f0d74
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/EP/ep.f
@@ -0,0 +1,272 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   E P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB EP code.               !
+!    Refer to NAS Technical Reports 95-020 for details.                   !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Author: P. O. Frederickson 
+c         D. H. Bailey
+c         A. C. Woo
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program EMBAR
+c---------------------------------------------------------------------
+C
+c   This is the serial version of the APP Benchmark 1,
+c   the "embarassingly parallel" benchmark.
+c
+c
+c   M is the Log_2 of the number of complex pairs of uniform (0, 1) random
+c   numbers.  MK is the Log_2 of the size of each batch of uniform random
+c   numbers.  MK can be set for convenience on a given system, since it does
+c   not affect the results.
+
+      implicit none
+
+      include 'npbparams.h'
+
+      double precision Mops, epsilon, a, s, t1, t2, t3, t4, x, x1, 
+     >                 x2, q, sx, sy, tm, an, tt, gc, dum(3)
+      double precision sx_verify_value, sy_verify_value, sx_err, sy_err
+      integer          mk, mm, nn, nk, nq, np, 
+     >                 i, ik, kk, l, k, nit,
+     >                 k_offset, j, fstatus
+      logical          verified, timers_enabled
+      external         randlc, timer_read
+      double precision randlc, timer_read
+      character*15     size
+
+      parameter (mk = 16, mm = m - mk, nn = 2 ** mm,
+     >           nk = 2 ** mk, nq = 10, epsilon=1.d-8,
+     >           a = 1220703125.d0, s = 271828183.d0)
+
+      common/storage/ x(2*nk), q(0:nq-1)
+      data             dum /1.d0, 1.d0, 1.d0/
+
+
+      open(unit=2, file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timers_enabled = .true.
+         close(2)
+      else
+         timers_enabled = .false.
+      endif
+
+c   Because the size of the problem is too large to store in a 32-bit
+c   integer for some classes, we put it into a string (for printing).
+c   Have to strip off the decimal point put in there by the floating
+c   point print statement (internal file)
+
+      write(*, 1000)
+      write(size, '(f15.0)' ) 2.d0**(m+1)
+      j = 15
+      if (size(j:j) .eq. '.') j = j - 1
+      write (*,1001) size(1:j)
+      write (*,*)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - EP Benchmark', /)
+ 1001 format(' Number of random numbers generated: ', a15)
+
+      verified = .false.
+
+c   Compute the number of "batches" of random number pairs generated 
+c   per processor. Adjust if the number of processors does not evenly 
+c   divide the total number
+
+      np = nn 
+
+
+c   Call the random number generator functions and initialize
+c   the x-array to reduce the effects of paging on the timings.
+c   Also, call all mathematical functions that are used. Make
+c   sure these initializations cannot be eliminated as dead code.
+
+      call vranlc(0, dum(1), dum(2), dum(3))
+      dum(1) = randlc(dum(2), dum(3))
+      do 5    i = 1, 2*nk
+         x(i) = -1.d99
+ 5    continue
+      Mops = log(sqrt(abs(max(1.d0,1.d0))))
+
+      
+      call timer_clear(1)
+      call timer_clear(2)
+      call timer_clear(3)
+      call timer_start(1)
+
+      t1 = a
+      call vranlc(0, t1, a, x)
+
+c   Compute AN = A ^ (2 * NK) (mod 2^46).
+
+      t1 = a
+
+      do 100 i = 1, mk + 1
+         t2 = randlc(t1, t1)
+ 100  continue
+
+      an = t1
+      tt = s
+      gc = 0.d0
+      sx = 0.d0
+      sy = 0.d0
+
+      do 110 i = 0, nq - 1
+         q(i) = 0.d0
+ 110  continue
+
+c   Each instance of this loop may be performed independently. We compute
+c   the k offsets separately to take into account the fact that some nodes
+c   have more numbers to generate than others
+
+      k_offset = -1
+
+      do 150 k = 1, np
+         kk = k_offset + k 
+         t1 = s
+         t2 = an
+
+c        Find starting seed t1 for this kk.
+
+         do 120 i = 1, 100
+            ik = kk / 2
+            if (2 * ik .ne. kk) t3 = randlc(t1, t2)
+            if (ik .eq. 0) goto 130
+            t3 = randlc(t2, t2)
+            kk = ik
+ 120     continue
+
+c        Compute uniform pseudorandom numbers.
+ 130     continue
+
+         if (timers_enabled) call timer_start(3)
+         call vranlc(2 * nk, t1, a, x)
+         if (timers_enabled) call timer_stop(3)
+
+c        Compute Gaussian deviates by acceptance-rejection method and 
+c        tally counts in concentric square annuli.  This loop is not 
+c        vectorizable. 
+
+         if (timers_enabled) call timer_start(2)
+
+         do 140 i = 1, nk
+            x1 = 2.d0 * x(2*i-1) - 1.d0
+            x2 = 2.d0 * x(2*i) - 1.d0
+            t1 = x1 ** 2 + x2 ** 2
+            if (t1 .le. 1.d0) then
+               t2   = sqrt(-2.d0 * log(t1) / t1)
+               t3   = (x1 * t2)
+               t4   = (x2 * t2)
+               l    = max(abs(t3), abs(t4))
+               q(l) = q(l) + 1.d0
+               sx   = sx + t3
+               sy   = sy + t4
+            endif
+ 140     continue
+
+         if (timers_enabled) call timer_stop(2)
+
+ 150  continue
+
+
+      do 160 i = 0, nq - 1
+        gc = gc + q(i)
+ 160  continue
+
+      call timer_stop(1)
+      tm  = timer_read(1)
+
+      nit=0
+      verified = .true.
+      if (m.eq.24) then
+         sx_verify_value = -3.247834652034740D+3
+         sy_verify_value = -6.958407078382297D+3
+      elseif (m.eq.25) then
+         sx_verify_value = -2.863319731645753D+3
+         sy_verify_value = -6.320053679109499D+3
+      elseif (m.eq.28) then
+         sx_verify_value = -4.295875165629892D+3
+         sy_verify_value = -1.580732573678431D+4
+      elseif (m.eq.30) then
+         sx_verify_value =  4.033815542441498D+4
+         sy_verify_value = -2.660669192809235D+4
+      elseif (m.eq.32) then
+         sx_verify_value =  4.764367927995374D+4
+         sy_verify_value = -8.084072988043731D+4
+      elseif (m.eq.36) then
+         sx_verify_value =  1.982481200946593D+5
+         sy_verify_value = -1.020596636361769D+5
+      elseif (m.eq.40) then
+         sx_verify_value = -5.319717441530D+05
+         sy_verify_value = -3.688834557731D+05
+      else
+         verified = .false.
+      endif
+      if (verified) then
+         sx_err = abs((sx - sx_verify_value)/sx_verify_value)
+         sy_err = abs((sy - sy_verify_value)/sy_verify_value)
+         verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon))
+      endif
+      Mops = 2.d0**(m+1)/tm/1000000.d0
+
+      write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1)
+ 11   format ('EP Benchmark Results:'//'CPU Time =',f10.4/'N = 2^',
+     >        i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/
+     >        'Counts:'/(i3,0p,f15.0))
+
+      call print_results('EP', class, m+1, 0, 0, nit,
+     >                   tm, Mops, 
+     >                   'Random numbers generated', 
+     >                   verified, npbversion, compiletime, cs1,
+     >                   cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      if (timers_enabled) then
+         if (tm .le. 0.d0) tm = 1.0
+         tt = timer_read(1)
+         print 810, 'Total time:    ', tt, tt*100./tm
+         tt = timer_read(2)
+         print 810, 'Gaussian pairs:', tt, tt*100./tm
+         tt = timer_read(3)
+         print 810, 'Random numbers:', tt, tt*100./tm
+810      format(1x,a,f9.3,' (',f6.2,'%)')
+      endif
+
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/Makefile
new file mode 100644
index 0000000..116d55d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/Makefile
@@ -0,0 +1,29 @@
+SHELL=/bin/sh
+BENCHMARK=ft
+BENCHMARKU=FT
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = appft.o auxfnct.o fft3d.o mainft.o verify.o \
+       ${COMMON}/${RAND}.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+
+
+.f.o:
+	${FCOMPILE} $<
+
+appft.o:	appft.f  global.h npbparams.h
+auxfnct.o:	auxfnct.f  global.h npbparams.h
+fft3d.o:	fft3d.f  global.h npbparams.h
+mainft.o:	mainft.f  global.h npbparams.h
+verify.o:	verify.f  global.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f ft npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/appft.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/appft.f
new file mode 100644
index 0000000..cdde43b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/appft.f
@@ -0,0 +1,101 @@
+         subroutine appft (niter, total_time, verified)
+         implicit none
+         include  'global.h'
+         integer niter
+         double precision total_time
+         logical verified
+!
+! Local variables
+!
+         integer i, j, k, kt, n12, n22, n32, ii, jj, kk, ii2, ik2
+         double precision ap
+         double precision twiddle(nx+1,ny,nz)
+        
+         double complex xnt(nx+1,ny,nz),y(nx+1,ny,nz),
+     &                  pad1(128),pad2(128)
+         common /mainarrays/ xnt,pad1,y,pad2,twiddle
+
+         double complex exp1(nx), exp2(ny), exp3(nz)
+
+         do i=1,15
+           call timer_clear(i)
+         end do         
+
+         call timer_start(2)         
+         call compute_initial_conditions(xnt,nx,ny,nz)
+         
+         call CompExp( nx, exp1 )
+         call CompExp( ny, exp2 )
+         call CompExp( nz, exp3 )           
+         call fftXYZ(1,xnt,y,exp1,exp2,exp3,nx,ny,nz)
+         call timer_stop(2)      
+
+         call timer_start(1)
+         if (timers_enabled) call timer_start(13)
+         
+         n12 = nx/2
+         n22 = ny/2
+         n32 = nz/2
+         ap = - 4.d0 * alpha * pi ** 2
+         do i = 1, nz
+           ii = i-1-((i-1)/n32)*nz
+           ii2 = ii*ii
+           do k = 1, ny
+             kk = k-1-((k-1)/n22)*ny
+             ik2 = ii2 + kk*kk
+             do j = 1, nx
+                 jj = j-1-((j-1)/n12)*nx
+                 twiddle(j,k,i) = exp(ap*dble(jj*jj + ik2))
+               end do
+            end do
+         end do
+         if (timers_enabled) call timer_stop(13)      
+         
+         if (timers_enabled) call timer_start(12)
+         call compute_initial_conditions(xnt,nx,ny,nz)             
+         if (timers_enabled) call timer_stop(12)      
+         if (timers_enabled) call timer_start(15)      
+         call fftXYZ(1,xnt,y,exp1,exp2,exp3,nx,ny,nz)
+         if (timers_enabled) call timer_stop(15)      
+
+         do kt = 1, niter
+           if (timers_enabled) call timer_start(11)      
+           call evolve(xnt,y,twiddle,nx,ny,nz)
+           if (timers_enabled) call timer_stop(11)      
+           if (timers_enabled) call timer_start(15)      
+           call fftXYZ(-1,xnt,xnt,exp1,exp2,exp3,nx,ny,nz)
+           if (timers_enabled) call timer_stop(15)      
+           if (timers_enabled) call timer_start(10)      
+           call CalculateChecksum(sums(kt),kt,xnt,nx,ny,nz)           
+           if (timers_enabled) call timer_stop(10)      
+         end do
+!
+! Verification test.
+!
+         if (timers_enabled) call timer_start(14)      
+         call verify(nx, ny, nz, niter, sums, verified)
+         if (timers_enabled) call timer_stop(14)      
+         call timer_stop(1)
+
+         total_time = timer_read(1)
+         if (.not.timers_enabled) return
+
+         print*,'FT subroutine timers '    
+         write(*,40) 'FT total                  ', timer_read(1)
+         write(*,40) 'WarmUp time               ', timer_read(2)
+         write(*,40) 'fftXYZ body               ', timer_read(3)
+         write(*,40) 'Swarztrauber              ', timer_read(4)
+         write(*,40) 'X time                    ', timer_read(7)
+         write(*,40) 'Y time                    ', timer_read(8)
+         write(*,40) 'Z time                    ', timer_read(9)
+         write(*,40) 'CalculateChecksum         ', timer_read(10)
+         write(*,40) 'evolve                    ', timer_read(11)
+         write(*,40) 'compute_initial_conditions', timer_read(12)
+         write(*,40) 'twiddle                   ', timer_read(13)
+         write(*,40) 'verify                    ', timer_read(14)
+         write(*,40) 'fftXYZ                    ', timer_read(15)
+         write(*,40) 'Benchmark time            ', total_time
+   40    format(' ',A26,' =',F9.4)
+
+         return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/auxfnct.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/auxfnct.f
new file mode 100644
index 0000000..80ede7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/auxfnct.f
@@ -0,0 +1,180 @@
+c---------------------------------------------------------------------
+c compute the roots-of-unity array that will be used for subsequent FFTs. 
+c---------------------------------------------------------------------
+      subroutine CompExp (n, exponent)
+
+      implicit none
+      integer n
+      double complex exponent(n) 
+      integer ilog2
+      external ilog2      
+     
+      integer m,nu,ku,i,j,ln
+      double precision t, ti, pi 
+      data pi /3.141592653589793238d0/
+
+      nu = n
+      m = ilog2(n)
+      exponent(1) = m
+      ku = 2
+      ln = 1
+      do j = 1, m
+         t = pi / ln
+         do i = 0, ln - 1
+            ti = i * t
+            exponent(i+ku) = dcmplx(cos(ti),sin(ti))
+         enddo        
+         ku = ku + ln
+         ln = 2 * ln
+      enddo
+            
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      integer function ilog2(n)
+      implicit none
+      integer n
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------  
+      integer nn, lg
+      if (n .eq. 1) then
+         ilog2=0
+         return
+      endif
+      lg = 1
+      nn = 2
+      do while (nn .lt. n)
+         nn = nn*2
+         lg = lg+1
+      end do
+      ilog2 = lg
+      return
+      end
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ipow46(a, exponent, result)
+c---------------------------------------------------------------------
+c compute a^exponent mod 2^46
+c---------------------------------------------------------------------
+
+      implicit none
+      double precision a, result, dummy, q, r
+      integer exponent, n, n2
+      external randlc
+      double precision randlc
+c---------------------------------------------------------------------
+c Use
+c   a^n = a^(n/2)*a^(n/2) if n even else
+c   a^n = a*a^(n-1)       if n odd
+c---------------------------------------------------------------------
+      result = 1
+      if (exponent .eq. 0) return
+      q = a
+      r = 1
+      n = exponent
+
+      do while (n .gt. 1)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q) 
+            n = n2
+         else
+            dummy = randlc(r, q)
+            n = n-1
+         endif
+      end do
+      dummy = randlc(r, q)
+      result = r
+      return
+      end
+c---------------------------------------------------------------------
+      subroutine CalculateChecksum(csum,iterN,u,d1,d2,d3)
+        implicit none
+        integer iterN
+        integer d1,d2,d3
+        double complex csum
+        double complex u(d1+1,d2,d3)
+        integer i, i1, ii, ji, ki
+        csum = dcmplx (0.0, 0.0)
+        do i = 1, 1024
+          i1 = i
+          ii = mod (i1, d1) + 1
+          ji = mod (3 * i1, d2) + 1
+          ki = mod (5 * i1, d3) + 1
+          csum = csum + u(ii,ji,ki)
+        end do
+        csum = csum/dble(d1*d2*d3)
+        write(*,30) iterN, csum
+ 30     format (' T =',I5,5X,'Checksum =',1P2D22.12)
+      return
+      end
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine compute_initial_conditions(u0,d1,d2,d3)
+
+      implicit none
+      include 'npbparams.h'
+      integer d1,d2,d3
+      double complex u0(d1+1,d2,d3), tmp(maxdim)
+      double precision x0, start, an, dummy
+      double precision RanStarts(maxdim)
+
+      integer i,j,k
+      double precision seed, a
+      parameter (seed = 314159265.d0, a = 1220703125.d0)
+      external randlc
+      double precision randlc
+      
+      start = seed                                    
+c---------------------------------------------------------------------
+c Jump to the starting element for our first plane.
+c---------------------------------------------------------------------
+      call ipow46(a, 0, an)
+      dummy = randlc(start, an)
+      call ipow46(a, 2*d1*d2, an)
+c---------------------------------------------------------------------
+c Go through by z planes filling in one square at a time.
+c---------------------------------------------------------------------
+      RanStarts(1) = start
+      do k = 2, d3 
+         dummy = randlc(start, an)
+         RanStarts(k) = start
+      end do
+      
+      do k = 1, d3 
+         x0 = RanStarts(k)
+         do j = 1, d2 
+           call vranlc(2*d1, x0, a, tmp)
+           do i = 1, d1 
+             u0(i,j,k)=tmp(i)
+           end do
+         end do
+      end do
+
+      return
+      end
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine evolve(x,y,twiddle,nx,ny,nz)
+      implicit none
+      integer nx,ny,nz
+      double complex x(nx+1,ny,nz),y(nx+1,ny,nz)
+      real*8 twiddle(nx+1,ny,nz)
+      integer i,j,k
+           do i = 1, nz
+             do k = 1, ny
+               do j = 1, nx
+                   y(j,k,i)=y(j,k,i)*twiddle(j,k,i)
+                   x(j,k,i)=y(j,k,i)
+                 end do
+              end do
+           end do
+      
+      return
+      end
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/fft3d.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/fft3d.f
new file mode 100644
index 0000000..254ad52
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/fft3d.f
@@ -0,0 +1,174 @@
+c---------------------------------------------------------------------   
+      subroutine Swarztrauber(is,m,vlen,n,x,xd1,exponent)
+
+      implicit none
+      include 'global.h'
+c---------------------------------------------------------------------
+c   Computes NY N-point complex-to-complex FFTs of X using an algorithm due
+c   to Swarztrauber.  X is both the input and the output array, while Y is a 
+c   scratch array.  It is assumed that N = 2^M.  Before calling 
+c   Swarztrauber to 
+c   perform FFTs
+c---------------------------------------------------------------------
+      integer is,m,vlen,n,xd1
+      double complex x(xd1,n), exponent(n)      
+
+      integer i,j,l
+      double complex u1,x11,x21
+      integer k, n1,li,lj,lk,ku,i11,i12,i21,i22
+      
+      if (timers_enabled) call timer_start(4)
+c---------------------------------------------------------------------
+c   Perform one variant of the Stockham FFT.
+c---------------------------------------------------------------------
+      n1 = n / 2
+      lj = 1
+      li = 2 ** m
+      do l = 1, m, 2
+        lk = lj
+        lj = 2 * lk
+        li = li / 2
+        ku = li + 1
+
+        do i = 0, li - 1
+          i11 = i * lk + 1
+          i12 = i11 + n1
+          i21 = i * lj + 1
+          i22 = i21 + lk
+        
+          if (is .ge. 1) then
+            u1 = exponent(ku+i)
+          else
+            u1 = dconjg (exponent(ku+i))
+          endif
+          do k = 0, lk - 1
+            do j = 1, vlen
+              x11 = x(j,i11+k)
+              x21 = x(j,i12+k)
+              scr(j,i21+k) = x11 + x21
+              scr(j,i22+k) = u1 * (x11 - x21)
+            end do
+          end do
+        end do
+
+        if (l .eq. m) then
+          do k = 1, n
+            do j = 1, vlen
+              x(j,k) = scr(j,k)
+            enddo
+          enddo
+        else
+          lk = lj
+          lj = 2 * lk
+          li = li / 2
+          ku = li + 1
+
+          do i = 0, li - 1
+            i11 = i * lk + 1
+            i12 = i11 + n1
+            i21 = i * lj + 1
+            i22 = i21 + lk
+        
+            if (is .ge. 1) then
+              u1 = exponent(ku+i)
+            else
+              u1 = dconjg (exponent(ku+i))
+            endif
+            do k = 0, lk - 1
+              do j = 1, vlen
+                x11 = scr(j,i11+k)
+                x21 = scr(j,i12+k)
+                x(j,i21+k) = x11 + x21
+                x(j,i22+k) = u1 * (x11 - x21)
+              end do
+            end do
+          end do
+        endif
+      end do
+      if (timers_enabled) call timer_stop(4)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+      subroutine fftXYZ(sign,x,xout,exp1,exp2,exp3,n1,n2,n3)
+        implicit none
+        include 'global.h'
+        integer sign,n1,n2,n3
+        double complex x(n1+1,n2,n3)
+        double complex xout((n1+1)*n2*n3)
+        double complex exp1(n1), exp2(n2), exp3(n3)
+        integer i, j, k, log
+        integer bls,ble
+        integer len
+        integer blkp
+
+        if (timers_enabled) call timer_start(3)
+
+        fftblock=CacheSize/n1
+        if(fftblock.ge.BlockMax) fftblock=BlockMax
+        blkp=fftblock+1
+        log = ilog2( n1)
+        if (timers_enabled) call timer_start(7)
+        do k = 1, n3
+          do bls = 1, n2, fftblock
+            ble = bls + fftblock - 1
+            if ( ble .gt. n2) ble = n2
+            len=ble-bls+1
+            do j = bls, ble
+            do i = 1, n1
+              plane(j-bls+1+blkp*(i-1)) = x(i,j,k)
+            end do
+            end do
+            call Swarztrauber(sign,log,len,n1,plane,blkp,exp1)     
+            do j = bls, ble
+            do i = 1, n1
+              x(i,j,k)=plane(j-bls+1+blkp*(i-1))
+            end do
+            end do
+          end do
+        end do
+        if (timers_enabled) call timer_stop(7)
+
+        fftblock=CacheSize/n2
+        if(fftblock.ge.BlockMax) fftblock=BlockMax
+        blkp=fftblock+1
+        log = ilog2( n2 )
+        if (timers_enabled) call timer_start(8)
+        do k = 1, n3
+          do bls = 1, n1, fftblock
+            ble = bls + fftblock - 1
+            if ( ble .gt. n1) ble = n1
+            len=ble-bls+1
+            call Swarztrauber(sign,log,len,n2,x(bls,1,k),n1+1,exp2)
+          enddo
+        end do
+        if (timers_enabled) call timer_stop(8)
+
+        fftblock=CacheSize/n3
+        if(fftblock.ge.BlockMax) fftblock=BlockMax
+        blkp=fftblock+1
+        log = ilog2(n3)
+        if (timers_enabled) call timer_start(9)
+        do k = 1, n2
+          do bls = 1, n1, fftblock
+            ble = bls + fftblock - 1
+            if ( ble .gt. n1) ble = n1
+            len=ble-bls+1
+            do i = 1,n3
+            do j = bls, ble
+              plane(j-bls+1+blkp*(i-1)) = x(j,k,i)
+            end do
+            end do
+            call Swarztrauber(sign,log,len,n3,plane,blkp,exp3)
+            do i = 0,n3-1
+            do j = bls, ble
+              xout(j+(n1+1)*(k-1+n2*i)) = plane(j-bls+1+blkp*i)
+            end do
+            end do
+          end do
+        end do
+        if (timers_enabled) call timer_stop(9)
+        if (timers_enabled) call timer_stop(3)
+        return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/global.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/global.h
new file mode 100644
index 0000000..1851da2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/global.h
@@ -0,0 +1,45 @@
+      include 'npbparams.h'
+
+c Cache blocking params. These values are good for most
+c RISC processors.  
+c FFT parameters:
+c  fftblock controls how many ffts are done at a time. 
+c  The default is appropriate for most cache-based machines
+c  On vector machines, the FFT can be vectorized with vector
+c  length equal to the block size, so the block size should
+c  be as large as possible. This is the size of the smallest
+c  dimension of the problem: 128 for class A, 256 for class B and
+c  512 for class C.
+
+      integer fftblock_default, fftblockpad_default, 
+     &        CacheSize,BlockMax
+      parameter (fftblock_default=16, 
+     &           fftblockpad_default=18,
+     &           CacheSize=8192,
+     &           BlockMax=32)
+      
+      integer fftblock, fftblockpad
+      common /blockinfo/ fftblock, fftblockpad
+
+      external timer_read
+      double precision timer_read
+      external ilog2
+      integer ilog2
+
+      external randlc
+      double precision randlc
+      
+      double complex  plane((BlockMax+1)*maxdim),pad(128),
+     &                scr(BlockMax+1,maxdim)
+      common /workarr/ plane,pad,scr
+
+      double precision seed, a, pi, alpha
+      parameter (seed = 314159265.d0, a = 1220703125.d0, 
+     .  pi = 3.141592653589793238d0, alpha=1.0d-6)
+
+c for checksum data
+      double complex sums(0:niter_default)
+      common /sumcomm/ sums
+
+      logical timers_enabled
+      common /timerscomm/ timers_enabled
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/mainft.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/mainft.f
new file mode 100644
index 0000000..7068a35
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/mainft.f
@@ -0,0 +1,128 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   F T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB FT code.               !
+!    Refer to NAS Technical Reports 95-020 for details.                   !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: D. Bailey
+c          W. Saphir
+c
+c          M. Frumkin
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c FT benchmark
+c---------------------------------------------------------------------
+      program mainft
+         implicit none
+         include 'global.h'
+
+         integer i, niter, fstatus
+         character class
+         double precision total_time, mflops
+         logical verified
+
+         open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+         if (fstatus .eq. 0) then
+            timers_enabled = .true.
+            close(2)
+         else
+            timers_enabled = .false.
+         endif
+
+         niter=niter_default
+
+         write(*, 1000)
+         write(*, 1001) nx, ny, nz
+         write(*, 1002) niter
+         write(*, *)
+
+ 1000    format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - FT Benchmark', /)
+ 1001    format(' Size                : ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations          :     ', i10)
+
+         call getclass(class)
+!
+         call appft (niter, total_time, verified)
+!
+         if( total_time .ne. 0. ) then
+           mflops = 1.0d-6*float(ntotal) *
+     >             (14.8157+7.19641*log(float(ntotal))
+     >          +  (5.23518+7.21113*log(float(ntotal)))*niter)
+     >                 /total_time
+         else
+           mflops = 0.0
+         endif
+         call print_results('FT', class, nx, ny, nz, niter,
+     >      total_time, mflops, '          floating point', verified, 
+     >      npbversion, compiletime, cs1, cs2, cs3, cs4, 
+     >      cs5, cs6, cs7)
+!
+      end
+      
+      subroutine getclass(class)
+        implicit none
+        include 'npbparams.h'
+        character class
+        if ((nx .eq. 64) .and. (ny .eq. 64) .and.                 
+     &      (nz .eq. 64) .and. (niter_default .eq. 6)) then
+          class='S'
+        else if ((nx .eq. 128) .and. (ny .eq. 128) .and.
+     &           (nz .eq. 32) .and. (niter_default .eq. 6)) then
+          class='W'
+        else if ((nx .eq. 256) .and. (ny .eq. 256) .and.
+     &           (nz .eq. 128) .and. (niter_default .eq. 6)) then
+          class='A'
+        else if ((nx .eq. 512) .and. (ny .eq. 256) .and.
+     &           (nz .eq. 256) .and. (niter_default .eq. 20)) then
+          class='B'
+        else if ((nx .eq. 512) .and. (ny .eq. 512) .and.
+     &           (nz .eq. 512) .and. (niter_default .eq. 20)) then
+          class='C'
+        else if ((nx .eq. 2048) .and. (ny .eq. 1024) .and.
+     &           (nz .eq. 1024) .and. (niter_default .eq. 25)) then
+          class='D'
+        else
+          class='U'
+        endif
+
+        return
+      end
+      
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/verify.f
new file mode 100644
index 0000000..51f94b0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/FT/verify.f
@@ -0,0 +1,207 @@
+!
+! FT verification routine.
+!
+      subroutine verify(n1, n2, n3, nt, cksum, verified)
+        implicit none
+        include 'npbparams.h'
+!
+! Arguments.
+!
+
+         integer n1, n2, n3, nt
+         double complex cksum(0:nt)
+         logical verified
+
+!
+! Local variables.
+!
+         integer kt
+         double complex cexpd(25)
+         real*8 epsilon, err
+!
+! Initialize tolerance level and success flag.
+!
+         epsilon = 1.0d-12
+         verified = .true.
+!
+         if ((n1 .eq. 64) .and. (n2 .eq. 64) .and.                 
+     &            (n3 .eq. 64) .and. (nt .eq. 6)) then
+!
+! Class S reference values.
+!
+            cexpd(1) = dcmplx(554.6087004964D0, 484.5363331978D0)
+            cexpd(2) = dcmplx(554.6385409189D0, 486.5304269511D0)
+            cexpd(3) = dcmplx(554.6148406171D0, 488.3910722336D0)
+            cexpd(4) = dcmplx(554.5423607415D0, 490.1273169046D0)
+            cexpd(5) = dcmplx(554.4255039624D0, 491.7475857993D0)
+            cexpd(6) = dcmplx(554.2683411902D0, 493.2597244941D0)
+            
+         else if ((n1 .eq. 128) .and. (n2 .eq. 128) .and.                 
+     &            (n3 .eq. 32) .and. (nt .eq. 6)) then
+!
+! Class W reference values.
+!
+            cexpd(1) = dcmplx(567.3612178944D0, 529.3246849175D0)
+            cexpd(2) = dcmplx(563.1436885271D0, 528.2149986629D0)
+            cexpd(3) = dcmplx(559.4024089970D0, 527.0996558037D0)
+            cexpd(4) = dcmplx(556.0698047020D0, 526.0027904925D0)
+            cexpd(5) = dcmplx(553.0898991250D0, 524.9400845633D0)
+            cexpd(6) = dcmplx(550.4159734538D0, 523.9212247086D0)
+!
+         else if ((n1 .eq. 256) .and. (n2 .eq. 256) .and.               
+     &            (n3 .eq. 128) .and. (nt .eq. 6)) then
+!
+! Class A reference values.
+!
+            cexpd(1) = dcmplx(504.6735008193D0, 511.4047905510D0)
+            cexpd(2) = dcmplx(505.9412319734D0, 509.8809666433D0)
+            cexpd(3) = dcmplx(506.9376896287D0, 509.8144042213D0)
+            cexpd(4) = dcmplx(507.7892868474D0, 510.1336130759D0)
+            cexpd(5) = dcmplx(508.5233095391D0, 510.4914655194D0)
+            cexpd(6) = dcmplx(509.1487099959D0, 510.7917842803D0)
+!
+         else if ((n1 .eq. 512) .and. (n2 .eq. 256) .and.               
+     &            (n3 .eq. 256) .and. (nt .eq. 20)) then
+!
+! Class B reference values.
+!
+            cexpd(1)  = dcmplx(517.7643571579D0, 507.7803458597D0)
+            cexpd(2)  = dcmplx(515.4521291263D0, 508.8249431599D0)
+            cexpd(3)  = dcmplx(514.6409228649D0, 509.6208912659D0)
+            cexpd(4)  = dcmplx(514.2378756213D0, 510.1023387619D0)
+            cexpd(5)  = dcmplx(513.9626667737D0, 510.3976610617D0)
+            cexpd(6)  = dcmplx(513.7423460082D0, 510.5948019802D0)
+            cexpd(7)  = dcmplx(513.5547056878D0, 510.7404165783D0)
+            cexpd(8)  = dcmplx(513.3910925466D0, 510.8576573661D0)
+            cexpd(9)  = dcmplx(513.2470705390D0, 510.9577278523D0)
+            cexpd(10) = dcmplx(513.1197729984D0, 511.0460304483D0)
+            cexpd(11) = dcmplx(513.0070319283D0, 511.1252433800D0)
+            cexpd(12) = dcmplx(512.9070537032D0, 511.1968077718D0)
+            cexpd(13) = dcmplx(512.8182883502D0, 511.2616233064D0)
+            cexpd(14) = dcmplx(512.7393733383D0, 511.3203605551D0)
+            cexpd(15) = dcmplx(512.6691062020D0, 511.3735928093D0)
+            cexpd(16) = dcmplx(512.6064276004D0, 511.4218460548D0)
+            cexpd(17) = dcmplx(512.5504076570D0, 511.4656139760D0)
+            cexpd(18) = dcmplx(512.5002331720D0, 511.5053595966D0)
+            cexpd(19) = dcmplx(512.4551951846D0, 511.5415130407D0)
+            cexpd(20) = dcmplx(512.4146770029D0, 511.5744692211D0)
+!
+         else if ((n1 .eq. 512) .and. (n2 .eq. 512) .and.               
+     &            (n3 .eq. 512) .and. (nt .eq. 20)) then
+!
+! Class C reference values.
+!
+            cexpd(1)  = dcmplx(519.5078707457D0, 514.9019699238D0)
+            cexpd(2)  = dcmplx(515.5422171134D0, 512.7578201997D0)
+            cexpd(3)  = dcmplx(514.4678022222D0, 512.2251847514D0)
+            cexpd(4)  = dcmplx(514.0150594328D0, 512.1090289018D0)
+            cexpd(5)  = dcmplx(513.7550426810D0, 512.1143685824D0)
+            cexpd(6)  = dcmplx(513.5811056728D0, 512.1496764568D0)
+            cexpd(7)  = dcmplx(513.4569343165D0, 512.1870921893D0)
+            cexpd(8)  = dcmplx(513.3651975661D0, 512.2193250322D0)
+            cexpd(9)  = dcmplx(513.2955192805D0, 512.2454735794D0)
+            cexpd(10) = dcmplx(513.2410471738D0, 512.2663649603D0)
+            cexpd(11) = dcmplx(513.1971141679D0, 512.2830879827D0)
+            cexpd(12) = dcmplx(513.1605205716D0, 512.2965869718D0)
+            cexpd(13) = dcmplx(513.1290734194D0, 512.3075927445D0)
+            cexpd(14) = dcmplx(513.1012720314D0, 512.3166486553D0)
+            cexpd(15) = dcmplx(513.0760908195D0, 512.3241541685D0)
+            cexpd(16) = dcmplx(513.0528295923D0, 512.3304037599D0)
+            cexpd(17) = dcmplx(513.0310107773D0, 512.3356167976D0)
+            cexpd(18) = dcmplx(513.0103090133D0, 512.3399592211D0)
+            cexpd(19) = dcmplx(512.9905029333D0, 512.3435588985D0)
+            cexpd(20) = dcmplx(512.9714421109D0, 512.3465164008D0)
+!
+         else if ((n1 .eq. 2048) .and. (n2 .eq. 1024) .and.               
+     &            (n3 .eq. 1024) .and. (nt .eq. 25)) then
+!
+! Class D reference values.
+!
+            cexpd(1)  = dcmplx(512.2230065252D0, 511.8534037109D0)
+            cexpd(2)  = dcmplx(512.0463975765D0, 511.7061181082D0)
+            cexpd(3)  = dcmplx(511.9865766760D0, 511.7096364601D0)
+            cexpd(4)  = dcmplx(511.9518799488D0, 511.7373863950D0)
+            cexpd(5)  = dcmplx(511.9269088223D0, 511.7680347632D0)
+            cexpd(6)  = dcmplx(511.9082416858D0, 511.7967875532D0)
+            cexpd(7)  = dcmplx(511.8943814638D0, 511.8225281841D0)
+            cexpd(8)  = dcmplx(511.8842385057D0, 511.8451629348D0)
+            cexpd(9)  = dcmplx(511.8769435632D0, 511.8649119387D0)
+            cexpd(10) = dcmplx(511.8718203448D0, 511.8820803844D0)
+            cexpd(11) = dcmplx(511.8683569061D0, 511.8969781011D0)
+            cexpd(12) = dcmplx(511.8661708593D0, 511.9098918835D0)
+            cexpd(13) = dcmplx(511.8649768950D0, 511.9210777066D0)
+            cexpd(14) = dcmplx(511.8645605626D0, 511.9307604484D0)
+            cexpd(15) = dcmplx(511.8647586618D0, 511.9391362671D0)
+            cexpd(16) = dcmplx(511.8654451572D0, 511.9463757241D0)
+            cexpd(17) = dcmplx(511.8665212451D0, 511.9526269238D0)
+            cexpd(18) = dcmplx(511.8679083821D0, 511.9580184108D0)
+            cexpd(19) = dcmplx(511.8695433664D0, 511.9626617538D0)
+            cexpd(20) = dcmplx(511.8713748264D0, 511.9666538138D0)
+            cexpd(21) = dcmplx(511.8733606701D0, 511.9700787219D0)
+            cexpd(22) = dcmplx(511.8754661974D0, 511.9730095953D0)
+            cexpd(23) = dcmplx(511.8776626738D0, 511.9755100241D0)
+            cexpd(24) = dcmplx(511.8799262314D0, 511.9776353561D0)
+            cexpd(25) = dcmplx(511.8822370068D0, 511.9794338060D0)
+!
+         else if ((n1 .eq. 4096) .and. (n2 .eq. 2048) .and.               
+     &            (n3 .eq. 2048) .and. (nt .eq. 25)) then
+!
+! Class E reference values.
+!
+            cexpd(1)  = dcmplx(512.1601045346D0, 511.7395998266D0)
+            cexpd(2)  = dcmplx(512.0905403678D0, 511.8614716182D0)
+            cexpd(3)  = dcmplx(512.0623229306D0, 511.9074203747D0)
+            cexpd(4)  = dcmplx(512.0438418997D0, 511.9345900733D0)
+            cexpd(5)  = dcmplx(512.0311521872D0, 511.9551325550D0)
+            cexpd(6)  = dcmplx(512.0226088809D0, 511.9720179919D0)
+            cexpd(7)  = dcmplx(512.0169296534D0, 511.9861371665D0)
+            cexpd(8)  = dcmplx(512.0131225172D0, 511.9979364402D0)
+            cexpd(9)  = dcmplx(512.0104767108D0, 512.0077674092D0)
+            cexpd(10) = dcmplx(512.0085127969D0, 512.0159443121D0)
+            cexpd(11) = dcmplx(512.0069224127D0, 512.0227453670D0)
+            cexpd(12) = dcmplx(512.0055158164D0, 512.0284096041D0)
+            cexpd(13) = dcmplx(512.0041820159D0, 512.0331373793D0)
+            cexpd(14) = dcmplx(512.0028605402D0, 512.0370938679D0)
+            cexpd(15) = dcmplx(512.0015223011D0, 512.0404138831D0)
+            cexpd(16) = dcmplx(512.0001570022D0, 512.0432068837D0)
+            cexpd(17) = dcmplx(511.9987650555D0, 512.0455615860D0)
+            cexpd(18) = dcmplx(511.9973525091D0, 512.0475499442D0)
+            cexpd(19) = dcmplx(511.9959279472D0, 512.0492304629D0)
+            cexpd(20) = dcmplx(511.9945006558D0, 512.0506508902D0)
+            cexpd(21) = dcmplx(511.9930795911D0, 512.0518503782D0)
+            cexpd(22) = dcmplx(511.9916728462D0, 512.0528612016D0)
+            cexpd(23) = dcmplx(511.9902874185D0, 512.0537101195D0)
+            cexpd(24) = dcmplx(511.9889291565D0, 512.0544194514D0)
+            cexpd(25) = dcmplx(511.9876028049D0, 512.0550079284D0)
+!
+         else
+!
+            write (*,    120) 'not performed'
+            verified = .false.
+!
+         end if
+!
+! Verification test for results.
+!
+         if (verified) then
+
+            do kt = 1, nt
+              err = abs((cksum(kt)-cexpd(kt))/cexpd(kt))
+              if (.not.(err.le.epsilon)) then
+                verified = .false.
+                goto 100
+              endif     
+            end do
+  100       continue
+
+            if (verified) then
+               write (*,    120) 'successful'
+            else
+               write (*,    120) 'failed'
+            end if
+
+  120       format (' Verification test for FT ', a)
+         end if
+!
+         return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/Makefile
new file mode 100644
index 0000000..30e474d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/Makefile
@@ -0,0 +1,26 @@
+SHELL=/bin/sh
+BENCHMARK=is
+BENCHMARKU=IS
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = is.o \
+       ${COMMON}/c_print_results.o \
+       ${COMMON}/c_timers.o \
+       ${COMMON}/c_wtime.o
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+is.o:             is.c  npbparams.h
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/README.carefully b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/README.carefully
new file mode 100644
index 0000000..e782185
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/README.carefully
@@ -0,0 +1,39 @@
+Please note:  The IS code in this directory known as is.c is the most
+compact serial version of the NPB2.3 parallel IS that it is possible
+to devise.  As such, it is completely unnecessary to have any notion 
+of buckets at all in order to correctly solve the specified NPB1 IS 
+benchmark problem.  
+
+Nevertheless, it is possible to turn on bucketing via #ifdef'ed code.
+Then, the sort first rearranges the keys into buckets by range (the
+bucket's ranges evenly subdivide the total key range), and then
+ranks the contents of each bucket.  This results in key transfers
+first into contiguous elements of buckets.  This is relatively
+cache efficient, since there are a relatively small number of buckets.
+Then the key counting that occurs accesses contiguous array elements.
+Once again, accesses reuse cache lines efficiently.  Finally, the 
+accumulation of key multiplicities (the key count) which gives the key
+ranks also reuses cache line efficiently.
+
+But using the buckets more than doubles the amount of computational
+work that must be performed.  On machines with very large caches, the 
+aforementioned benefits may not exist, and the extra processing looks
+expensive. These examples apply to both CLASS A and B problems:
+
+    SP2-66MhzWN:  50% speedup with buckets                          
+    SGI Indy5000: 50% slowdown with buckets             
+    SGI O2000:   400% slowdown with buckets (Wow!)                
+
+
+Default setting is 
+
+    #define USE_BUCKETS
+
+i.e., buckets turned on!  To switch the setting off, simply comment
+out this line.
+
+It is a conjecture that cache access is the underlying mechanism 
+causing these variations.
+
+Note: If reporting timing results, either of these modes may be used 
+      without penalty.
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/is.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/is.c
new file mode 100644
index 0000000..a89dccd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/IS/is.c
@@ -0,0 +1,804 @@
+/*************************************************************************
+ *                                                                       * 
+ *       N  A  S     P A R A L L E L     B E N C H M A R K S  3.3        *
+ *                                                                       *
+ *                       S E R I A L    V E R S I O N                    * 
+ *                                                                       * 
+ *                                  I S                                  * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is a serial version of the NPB IS code.              *
+ *   Refer to NAS Technical Reports 95-020 for details.                  *
+ *                                                                       *
+ *   Permission to use, copy, distribute and modify this software        *
+ *   for any purpose with or without fee is hereby granted.  We          *
+ *   request, however, that all derived work reference the NAS           *
+ *   Parallel Benchmarks 3.3. This software is provided "as is"          *
+ *   without express or implied warranty.                                *
+ *                                                                       *
+ *   Information on NPB 3.3, including the technical report, the         *
+ *   original specifications, source code, results and information       *
+ *   on how to submit new results, is available at:                      *
+ *                                                                       *
+ *          http://www.nas.nasa.gov/Software/NPB/                        *
+ *                                                                       *
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   *
+ *                                                                       *
+ *         NAS Parallel Benchmarks Group                                 *
+ *         NASA Ames Research Center                                     *
+ *         Mail Stop: T27A-1                                             *
+ *         Moffett Field, CA   94035-1000                                *
+ *                                                                       *
+ *         E-mail:  npb@nas.nasa.gov                                     *
+ *         Fax:     (650) 604-3957                                       *
+ *                                                                       *
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Yarrow                                                   * 
+ *           H. Jin                                                      * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include "npbparams.h"
+#include <stdlib.h>
+#include <stdio.h>
+
+
+/*****************************************************************/
+/* For serial IS, buckets are not really req'd to solve NPB1 IS  */
+/* spec, but their use on some machines improves performance, on */
+/* other machines the use of buckets compromises performance,    */
+/* probably because it is extra computation which is not req'd.  */
+/* (Note: Mechanism not understood, probably cache related)      */
+/* Example:  SP2-66MhzWN:  50% speedup with buckets              */
+/* Example:  SGI Indy5000: 50% slowdown with buckets             */
+/* Example:  SGI O2000:   400% slowdown with buckets (Wow!)      */
+/*****************************************************************/
+/* To disable the use of buckets, comment out the following line */
+#define USE_BUCKETS
+
+
+/******************/
+/* default values */
+/******************/
+#ifndef CLASS
+#define CLASS 'S'
+#endif
+
+
+/*************/
+/*  CLASS S  */
+/*************/
+#if CLASS == 'S'
+#define  TOTAL_KEYS_LOG_2    16
+#define  MAX_KEY_LOG_2       11
+#define  NUM_BUCKETS_LOG_2   9
+#endif
+
+
+/*************/
+/*  CLASS W  */
+/*************/
+#if CLASS == 'W'
+#define  TOTAL_KEYS_LOG_2    20
+#define  MAX_KEY_LOG_2       16
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+/*************/
+/*  CLASS A  */
+/*************/
+#if CLASS == 'A'
+#define  TOTAL_KEYS_LOG_2    23
+#define  MAX_KEY_LOG_2       19
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS B  */
+/*************/
+#if CLASS == 'B'
+#define  TOTAL_KEYS_LOG_2    25
+#define  MAX_KEY_LOG_2       21
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS C  */
+/*************/
+#if CLASS == 'C'
+#define  TOTAL_KEYS_LOG_2    27
+#define  MAX_KEY_LOG_2       23
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS D  */
+/*************/
+#if CLASS == 'D'
+#define  TOTAL_KEYS_LOG_2    31
+#define  MAX_KEY_LOG_2       27
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+#if CLASS == 'D'
+#define  TOTAL_KEYS          (1L << TOTAL_KEYS_LOG_2)
+#else
+#define  TOTAL_KEYS          (1 << TOTAL_KEYS_LOG_2)
+#endif
+#define  MAX_KEY             (1 << MAX_KEY_LOG_2)
+#define  NUM_BUCKETS         (1 << NUM_BUCKETS_LOG_2)
+#define  NUM_KEYS            TOTAL_KEYS
+#define  SIZE_OF_BUFFERS     NUM_KEYS  
+                                           
+
+#define  MAX_ITERATIONS      10
+#define  TEST_ARRAY_SIZE     5
+
+
+/*************************************/
+/* Typedef: if necessary, change the */
+/* size of int here by changing the  */
+/* int type to, say, long            */
+/*************************************/
+#if CLASS == 'D'
+typedef  long INT_TYPE;
+#else
+typedef  int  INT_TYPE;
+#endif
+
+
+/********************/
+/* Some global info */
+/********************/
+INT_TYPE *key_buff_ptr_global;         /* used by full_verify to get */
+                                       /* copies of rank info        */
+
+int      passed_verification;
+                                 
+
+/************************************/
+/* These are the three main arrays. */
+/* See SIZE_OF_BUFFERS def above    */
+/************************************/
+INT_TYPE key_array[SIZE_OF_BUFFERS],    
+         key_buff1[MAX_KEY],    
+         key_buff2[SIZE_OF_BUFFERS],
+         partial_verify_vals[TEST_ARRAY_SIZE];
+
+#ifdef USE_BUCKETS
+INT_TYPE bucket_size[NUM_BUCKETS],                    
+         bucket_ptrs[NUM_BUCKETS];
+#endif
+
+
+/**********************/
+/* Partial verif info */
+/**********************/
+INT_TYPE test_index_array[TEST_ARRAY_SIZE],
+         test_rank_array[TEST_ARRAY_SIZE],
+
+         S_test_index_array[TEST_ARRAY_SIZE] = 
+                             {48427,17148,23627,62548,4431},
+         S_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {0,18,346,64917,65463},
+
+         W_test_index_array[TEST_ARRAY_SIZE] = 
+                             {357773,934767,875723,898999,404505},
+         W_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1249,11698,1039987,1043896,1048018},
+
+         A_test_index_array[TEST_ARRAY_SIZE] = 
+                             {2112377,662041,5336171,3642833,4250760},
+         A_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {104,17523,123928,8288932,8388264},
+
+         B_test_index_array[TEST_ARRAY_SIZE] = 
+                             {41869,812306,5102857,18232239,26860214},
+         B_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {33422937,10244,59149,33135281,99}, 
+
+         C_test_index_array[TEST_ARRAY_SIZE] = 
+                             {44172927,72999161,74326391,129606274,21736814},
+         C_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {61147,882988,266290,133997595,133525895},
+
+         D_test_index_array[TEST_ARRAY_SIZE] = 
+                             {1317351170,995930646,1157283250,1503301535,1453734525},
+         D_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1,36538729,1978098519,2145192618,2147425337};
+
+
+
+/***********************/
+/* function prototypes */
+/***********************/
+double	randlc( double *X, double *A );
+
+void full_verify( void );
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+
+void    timer_clear( int n );
+void    timer_start( int n );
+void    timer_stop( int n );
+double  timer_read( int n );
+
+
+/*
+ *    FUNCTION RANDLC (X, A)
+ *
+ *  This routine returns a uniform pseudorandom double precision number in the
+ *  range (0, 1) by using the linear congruential generator
+ *
+ *  x_{k+1} = a x_k  (mod 2^46)
+ *
+ *  where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+ *  before repeating.  The argument A is the same as 'a' in the above formula,
+ *  and X is the same as x_0.  A and X must be odd double precision integers
+ *  in the range (1, 2^46).  The returned value RANDLC is normalized to be
+ *  between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+ *  the new seed x_1, so that subsequent calls to RANDLC using the same
+ *  arguments will generate a continuous sequence.
+ *
+ *  This routine should produce the same results on any computer with at least
+ *  48 mantissa bits in double precision floating point data.  On Cray systems,
+ *  double precision should be disabled.
+ *
+ *  David H. Bailey     October 26, 1990
+ *
+ *     IMPLICIT DOUBLE PRECISION (A-H, O-Z)
+ *     SAVE KS, R23, R46, T23, T46
+ *     DATA KS/0/
+ *
+ *  If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46,
+ *  T23 = 2 ^ 23, and T46 = 2 ^ 46.  These are computed in loops, rather than
+ *  by merely using the ** operator, in order to insure that the results are
+ *  exact on all systems.  This code assumes that 0.5D0 is represented exactly.
+ */
+
+
+/*****************************************************************/
+/*************           R  A  N  D  L  C             ************/
+/*************                                        ************/
+/*************    portable random number generator    ************/
+/*****************************************************************/
+
+double	randlc( double *X, double *A )
+{
+      static int        KS=0;
+      static double	R23, R46, T23, T46;
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
+
+
+
+
+/*****************************************************************/
+/*************      C  R  E  A  T  E  _  S  E  Q      ************/
+/*****************************************************************/
+
+void	create_seq( double seed, double a )
+{
+	double x;
+	int    i, k;
+
+        k = MAX_KEY/4;
+
+	for (i=0; i<NUM_KEYS; i++)
+	{
+	    x = randlc(&seed, &a);
+	    x += randlc(&seed, &a);
+    	    x += randlc(&seed, &a);
+	    x += randlc(&seed, &a);  
+
+            key_array[i] = k*x;
+	}
+}
+
+
+
+
+/*****************************************************************/
+/*************    F  U  L  L  _  V  E  R  I  F  Y     ************/
+/*****************************************************************/
+
+
+void full_verify( void )
+{
+    INT_TYPE    i, j;
+
+
+    
+/*  Now, finally, sort the keys:  */
+
+#ifdef USE_BUCKETS
+
+    /* key_buff2[] already has the proper information, so do nothing */
+
+#else
+
+/*  Copy keys into work array; keys in key_array will be reassigned. */
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff2[i] = key_array[i];
+
+#endif
+
+    for( i=0; i<NUM_KEYS; i++ )
+        key_array[--key_buff_ptr_global[key_buff2[i]]] = key_buff2[i];
+
+
+/*  Confirm keys correctly sorted: count incorrectly sorted keys, if any */
+
+    j = 0;
+    for( i=1; i<NUM_KEYS; i++ )
+        if( key_array[i-1] > key_array[i] )
+            j++;
+
+
+    if( j != 0 )
+    {
+        printf( "Full_verify: number of keys out of sort: %ld\n",
+                (long)j );
+    }
+    else
+        passed_verification++;
+           
+
+}
+
+
+
+
+/*****************************************************************/
+/*************             R  A  N  K             ****************/
+/*****************************************************************/
+
+
+void rank( int iteration )
+{
+
+    INT_TYPE    i, k;
+
+    INT_TYPE    *key_buff_ptr, *key_buff_ptr2;
+
+#ifdef USE_BUCKETS
+    int shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2;
+    INT_TYPE    key;
+#endif
+
+
+    key_array[iteration] = iteration;
+    key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration;
+
+
+/*  Determine where the partial verify test keys are, load into  */
+/*  top of array bucket_size                                     */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        partial_verify_vals[i] = key_array[test_index_array[i]];
+
+#ifdef USE_BUCKETS
+
+/*  Initialize */
+    for( i=0; i<NUM_BUCKETS; i++ )  
+        bucket_size[i] = 0;
+
+/*  Determine the number of keys in each bucket */
+    for( i=0; i<NUM_KEYS; i++ )
+        bucket_size[key_array[i] >> shift]++;
+
+
+/*  Accumulative bucket sizes are the bucket pointers */
+    bucket_ptrs[0] = 0;
+    for( i=1; i< NUM_BUCKETS; i++ )  
+        bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1];
+
+
+/*  Sort into appropriate bucket */
+    for( i=0; i<NUM_KEYS; i++ )  
+    {
+        key = key_array[i];
+        key_buff2[bucket_ptrs[key >> shift]++] = key;
+    }
+
+    key_buff_ptr2 = key_buff2;
+
+#else
+
+    key_buff_ptr2 = key_array;
+
+#endif
+
+/*  Clear the work array */
+    for( i=0; i<MAX_KEY; i++ )
+        key_buff1[i] = 0;
+
+
+/*  Ranking of all keys occurs in this section:                 */
+
+    key_buff_ptr = key_buff1;
+
+/*  In this section, the keys themselves are used as their 
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff_ptr[key_buff_ptr2[i]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population                                                  */
+
+
+    for( i=0; i<MAX_KEY-1; i++ )   
+        key_buff_ptr[i+1] += key_buff_ptr[i];  
+
+
+/* This is the partial verify test section */
+/* Observe that test_rank_array vals are   */
+/* shifted differently for different cases */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+    {                                             
+        k = partial_verify_vals[i];          /* test vals were put here */
+        if( 0 < k  &&  k <= NUM_KEYS-1 )
+        {
+            INT_TYPE key_rank = key_buff_ptr[k-1];
+            int failed = 0;
+
+            switch( CLASS )
+            {
+                case 'S':
+                    if( i <= 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'W':
+                    if( i < 2 )
+                    {
+                        if( key_rank != test_rank_array[i]+(iteration-2) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'A':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-(iteration-1) )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'B':
+                    if( i == 1 || i == 2 || i == 4 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'C':
+                    if( i <= 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+                case 'D':
+                    if( i < 2 )
+        	    {
+                        if( key_rank != test_rank_array[i]+iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+        	    }
+                    else
+                    {
+                        if( key_rank != test_rank_array[i]-iteration )
+                            failed = 1;
+                        else
+                            passed_verification++;
+                    }
+                    break;
+            }
+            if( failed == 1 )
+                printf( "Failed partial verification: "
+                        "iteration %d, test key %d\n", 
+                         iteration, (int)i );
+        }
+    }
+
+
+
+
+/*  Make copies of rank info for use by full_verify: these variables
+    in rank are local; making them global slows down the code, probably
+    since they cannot be made register by compiler                        */
+
+    if( iteration == MAX_ITERATIONS ) 
+        key_buff_ptr_global = key_buff_ptr;
+
+}      
+
+
+/*****************************************************************/
+/*************             M  A  I  N             ****************/
+/*****************************************************************/
+
+int main( int argc, char **argv )
+{
+
+    int             i, iteration, timer_on;
+
+    double          timecounter;
+
+    FILE            *fp;
+
+
+/*  Initialize timers  */
+    timer_on = 0;            
+    if ((fp = fopen("timer.flag", "r")) != NULL) {
+        fclose(fp);
+        timer_on = 1;
+    }
+    timer_clear( 0 );
+    if (timer_on) {
+        timer_clear( 1 );
+        timer_clear( 2 );
+        timer_clear( 3 );
+    }
+
+    if (timer_on) timer_start( 3 );
+
+
+/*  Initialize the verification arrays if a valid class */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        switch( CLASS )
+        {
+            case 'S':
+                test_index_array[i] = S_test_index_array[i];
+                test_rank_array[i]  = S_test_rank_array[i];
+                break;
+            case 'A':
+                test_index_array[i] = A_test_index_array[i];
+                test_rank_array[i]  = A_test_rank_array[i];
+                break;
+            case 'W':
+                test_index_array[i] = W_test_index_array[i];
+                test_rank_array[i]  = W_test_rank_array[i];
+                break;
+            case 'B':
+                test_index_array[i] = B_test_index_array[i];
+                test_rank_array[i]  = B_test_rank_array[i];
+                break;
+            case 'C':
+                test_index_array[i] = C_test_index_array[i];
+                test_rank_array[i]  = C_test_rank_array[i];
+                break;
+            case 'D':
+                test_index_array[i] = D_test_index_array[i];
+                test_rank_array[i]  = D_test_rank_array[i];
+                break;
+        };
+
+        
+
+/*  Printout initial NPB info */
+    printf
+      ( "\n\n NAS Parallel Benchmarks (NPB3.3-SER) - IS Benchmark\n\n" );
+    printf( " Size:  %ld  (class %c)\n", (long)TOTAL_KEYS, CLASS );
+    printf( " Iterations:   %d\n", MAX_ITERATIONS );
+
+    if (timer_on) timer_start( 1 );
+
+/*  Generate random number sequence and subsequent keys on all procs */
+    create_seq( 314159265.00,                    /* Random number gen seed */
+                1220703125.00 );                 /* Random number gen mult */
+    if (timer_on) timer_stop( 1 );
+
+
+/*  Do one interation for free (i.e., untimed) to guarantee initialization of  
+    all data and code pages and respective tables */
+    rank( 1 );  
+
+/*  Start verification counter */
+    passed_verification = 0;
+
+    if( CLASS != 'S' ) printf( "\n   iteration\n" );
+
+/*  Start timer  */             
+    timer_start( 0 );
+
+
+/*  This is the main iteration */
+    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
+    {
+        if( CLASS != 'S' ) printf( "        %d\n", iteration );
+        rank( iteration );
+    }
+
+
+/*  End of timing, obtain maximum time of all processors */
+    timer_stop( 0 );
+    timecounter = timer_read( 0 );
+
+
+/*  This tests that keys are in sequence: sorting of last ranked key seq
+    occurs here, but is an untimed operation                             */
+    if (timer_on) timer_start( 2 );
+    full_verify();
+    if (timer_on) timer_stop( 2 );
+
+    if (timer_on) timer_stop( 3 );
+
+
+/*  The final printout  */
+    if( passed_verification != 5*MAX_ITERATIONS + 1 )
+        passed_verification = 0;
+    c_print_results( "IS",
+                     CLASS,
+                     (int)(TOTAL_KEYS/64),
+                     64,
+                     0,
+                     MAX_ITERATIONS,
+                     timecounter,
+                     ((double) (MAX_ITERATIONS*TOTAL_KEYS))
+                                                  /timecounter/1000000.,
+                     "keys ranked", 
+                     passed_verification,
+                     NPBVERSION,
+                     COMPILETIME,
+                     CC,
+                     CLINK,
+                     C_LIB,
+                     C_INC,
+                     CFLAGS,
+                     CLINKFLAGS );
+
+
+/*  Print additional timers  */
+    if (timer_on) {
+       double t_total, t_percent;
+
+       t_total = timer_read( 3 );
+       printf("\nAdditional timers -\n");
+       printf(" Total execution: %8.3f\n", t_total);
+       if (t_total == 0.0) t_total = 1.0;
+       timecounter = timer_read(1);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Initialization : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(0);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Benchmarking   : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(2);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Sorting        : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+    }
+
+
+    return 0;
+         /**************************/
+}        /*  E N D  P R O G R A M  */
+         /**************************/
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/Makefile
new file mode 100644
index 0000000..5fa7a3c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/Makefile
@@ -0,0 +1,64 @@
+SHELL=/bin/sh
+BENCHMARK=lu
+BENCHMARKU=LU
+VEC=
+
+include ../config/make.def
+
+OBJS = lu.o read_input.o \
+       domain.o setcoeff.o setbv.o exact.o setiv.o \
+       erhs.o ssor$(VEC).o rhs$(VEC).o l2norm.o \
+       jacld.o blts$(VEC).o jacu.o buts$(VEC).o error.o \
+       pintgr.o verify.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+
+# npbparams.h is included by applu.incl
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# applu.incl: npbparams.h
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o :
+	${FCOMPILE} $<
+
+lu.o:		lu.f applu.incl npbparams.h
+blts$(VEC).o:	blts$(VEC).f
+buts$(VEC).o:	buts$(VEC).f	
+erhs.o:		erhs.f applu.incl npbparams.h
+error.o:	error.f applu.incl npbparams.h
+exact.o:	exact.f applu.incl npbparams.h
+jacld.o:	jacld.f applu.incl npbparams.h
+jacu.o:		jacu.f applu.incl npbparams.h
+l2norm.o:	l2norm.f
+pintgr.o:	pintgr.f applu.incl npbparams.h
+read_input.o:	read_input.f applu.incl npbparams.h
+rhs$(VEC).o:	rhs$(VEC).f applu.incl npbparams.h
+setbv.o:	setbv.f applu.incl npbparams.h
+setiv.o:	setiv.f applu.incl npbparams.h
+setcoeff.o:	setcoeff.f applu.incl npbparams.h
+ssor$(VEC).o:	ssor$(VEC).f applu.incl npbparams.h
+domain.o:	domain.f applu.incl npbparams.h
+verify.o:	verify.f applu.incl npbparams.h
+
+clean:
+	- /bin/rm -f npbparams.h
+	- /bin/rm -f *.o *~
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/applu.incl b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/applu.incl
new file mode 100644
index 0000000..3bf2009
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/applu.incl
@@ -0,0 +1,158 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+c---  applu.incl   
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   npbparams.h defines parameters that depend on the class and 
+c   number of nodes
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+c---------------------------------------------------------------------
+c   parameters which can be overridden in runtime config file
+c   isiz1,isiz2,isiz3 give the maximum size
+c   ipr = 1 to print out verbose information
+c   omega = 2.0 is correct for all classes
+c   tolrsd is tolerance levels for steady state residuals
+c---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def, 
+     >                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08, 
+     >          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08, 
+     >          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,
+     >           c3 = 1.00d-01, c4 = 1.00d+00,
+     >           c5 = 1.40d+00 )
+
+c---------------------------------------------------------------------
+c   grid
+c---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+      common/cgcon/ dxi, deta, dzeta,
+     >              tx1, tx2, tx3,
+     >              ty1, ty2, ty3,
+     >              tz1, tz2, tz3,
+     >              nx, ny, nz, 
+     >              nx0, ny0, nz0,
+     >              ist, iend,
+     >              jst, jend,
+     >              ii1, ii2, 
+     >              ji1, ji2, 
+     >              ki1, ki2
+
+c---------------------------------------------------------------------
+c   dissipation
+c---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+      common/disp/ dx1,dx2,dx3,dx4,dx5,
+     >             dy1,dy2,dy3,dy4,dy5,
+     >             dz1,dz2,dz3,dz4,dz5,
+     >             dssp
+
+c---------------------------------------------------------------------
+c   field variables and residuals
+c   to improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.
+c   Note: corresponding array (called "v") in routines blts, buts, 
+c   and l2norm are similarly padded
+c---------------------------------------------------------------------
+      double precision u(5,isiz1/2*2+1,
+     >                     isiz2/2*2+1,
+     >                     isiz3),
+     >                 rsd(5,isiz1/2*2+1,
+     >                       isiz2/2*2+1,
+     >                       isiz3),
+     >                 frct(5,isiz1/2*2+1,
+     >                        isiz2/2*2+1,
+     >                        isiz3),
+     >                 flux(5,isiz1),
+     >                 qs(isiz1/2*2+1,isiz2/2*2+1,isiz3),
+     >                 rho_i(isiz1/2*2+1,isiz2/2*2+1,isiz3)
+
+      common/cvar/ u, rsd, frct, flux,
+     >             qs, rho_i
+
+
+c---------------------------------------------------------------------
+c   output control parameters
+c---------------------------------------------------------------------
+      integer ipr, inorm
+
+      common/cprcon/ ipr, inorm
+
+c---------------------------------------------------------------------
+c   newton-raphson iteration control parameters
+c---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),
+     >        rsdnm(5), errnm(5), frc, ttotal
+
+      common/ctscon/ dt, omega, tolrsd,
+     >               rsdnm, errnm, frc, ttotal,
+     >               itmax, invert
+
+      double precision a(5,5,isiz1/2*2+1,isiz2),
+     >                 b(5,5,isiz1/2*2+1,isiz2),
+     >                 c(5,5,isiz1/2*2+1,isiz2),
+     >                 d(5,5,isiz1/2*2+1,isiz2)
+
+      common/cjac/ a, b, c, d
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution
+c---------------------------------------------------------------------
+      double precision ce(5,13)
+
+      common/cexact/ ce
+
+c---------------------------------------------------------------------
+c   timers
+c---------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_rhs,t_jacld,t_blts,
+     >        t_jacu,t_buts,t_add,t_l2norm,t_last,t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_jacld = 6)
+      parameter (t_blts = 7)
+      parameter (t_jacu = 8)
+      parameter (t_buts = 9)
+      parameter (t_add = 10)
+      parameter (t_l2norm = 11)
+      parameter (t_last = 11)
+      logical timeron
+      double precision maxtime
+
+      common/timer/maxtime,timeron
+
+
+c---------------------------------------------------------------------
+c   end of include file
+c---------------------------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts.f
new file mode 100644
index 0000000..d6faf52
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts.f
@@ -0,0 +1,251 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v, 
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  nx0, ny0 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, *),
+     >        ldz( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldy( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldx( 5, 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+      end do
+
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  tv( m ) =  v( m, i, j, k )
+     > - omega * ( ldy( m, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( m, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( m, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( m, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( m, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( m, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( m, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( m, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( m, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( m, 5, i, j ) * v( 5, i-1, j, k ) )
+
+            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 4 ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = tv( 5 )
+     >                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = tv( 4 )
+     >                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = tv( 3 )
+     >                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = tv( 2 )
+     >                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = tv( 1 )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts_vec.f
new file mode 100644
index 0000000..a66f5b8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/blts_vec.f
@@ -0,0 +1,326 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,
+     >                  nx, ny, nz, k,
+     >                  omega,
+     >                  v, 
+     >                  ldz, ldy, ldx, d,
+     >                  ist, iend, jst, jend,
+     >                  lst, lend )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block lower triangular solution:
+c
+c                     v <-- ( L-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, *),
+     >        ldz( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldy( 5, 5, ldmx/2*2+1, ldmy),
+     >        ldx( 5, 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer lst, lend
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )
+     >    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )
+     >               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )
+     >               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )
+     >               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )
+     >               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         enddo
+      enddo
+
+
+      do l = lst, lend
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+
+                  tv( 1 ) =  v( 1, i, j, k )
+     > - omega * ( ldy( 1, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 1, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 1, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 1, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 1, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 1, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 1, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 1, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 1, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 1, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 2 ) =  v( 2, i, j, k )
+     > - omega * ( ldy( 2, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 2, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 2, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 2, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 2, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 2, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 2, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 2, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 2, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 2, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 3 ) =  v( 3, i, j, k )
+     > - omega * ( ldy( 3, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 3, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 3, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 3, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 3, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 3, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 3, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 3, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 3, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 3, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 4 ) =  v( 4, i, j, k )
+     > - omega * ( ldy( 4, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 4, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 4, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 4, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 4, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 4, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 4, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 4, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 4, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 4, 5, i, j ) * v( 5, i-1, j, k ) )
+                  tv( 5 ) =  v( 5, i, j, k )
+     > - omega * ( ldy( 5, 1, i, j ) * v( 1, i, j-1, k )
+     >           + ldx( 5, 1, i, j ) * v( 1, i-1, j, k )
+     >           + ldy( 5, 2, i, j ) * v( 2, i, j-1, k )
+     >           + ldx( 5, 2, i, j ) * v( 2, i-1, j, k )
+     >           + ldy( 5, 3, i, j ) * v( 3, i, j-1, k )
+     >           + ldx( 5, 3, i, j ) * v( 3, i-1, j, k )
+     >           + ldy( 5, 4, i, j ) * v( 4, i, j-1, k )
+     >           + ldx( 5, 4, i, j ) * v( 4, i-1, j, k )
+     >           + ldy( 5, 5, i, j ) * v( 5, i, j-1, k )
+     >           + ldx( 5, 5, i, j ) * v( 5, i-1, j, k ) )
+
+!            end do
+       
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c
+c   forward elimination
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )
+     >        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )
+     >        - tv( 4 ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            v( 5, i, j, k ) = tv( 5 )
+     >                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )
+     >           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = tv( 4 )
+     >                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )
+     >           - tmat( 3, 4 ) * v( 4, i, j, k )
+     >           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = tv( 3 )
+     >                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )
+     >           - tmat( 2, 3 ) * v( 3, i, j, k )
+     >           - tmat( 2, 4 ) * v( 4, i, j, k )
+     >           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = tv( 2 )
+     >                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )
+     >           - tmat( 1, 2 ) * v( 2, i, j, k )
+     >           - tmat( 1, 3 ) * v( 3, i, j, k )
+     >           - tmat( 1, 4 ) * v( 4, i, j, k )
+     >           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = tv( 1 )
+     >                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts.f
new file mode 100644
index 0000000..80174dd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts.f
@@ -0,0 +1,249 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0 )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5,ldmx/2*2+1, ldmy/2*2+1, *), 
+     >        tv( 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy),
+     >        udx( 5, 5, ldmx/2*2+1, ldmy),
+     >        udy( 5, 5, ldmx/2*2+1, ldmy),
+     >        udz( 5, 5, ldmx/2*2+1, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = tv( m, i, j )
+     > + omega * ( udy( m, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( m, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( m, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( m, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( m, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( m, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( m, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( m, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( m, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( m, 5, i, j ) * v( 5, i+1, j, k ) )
+            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i, j )
+               tmat( m, 2 ) = d( m, 2, i, j )
+               tmat( m, 3 ) = d( m, 3, i, j )
+               tmat( m, 4 ) = d( m, 4, i, j )
+               tmat( m, 5 ) = d( m, 5, i, j )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+        enddo
+      end do
+
+ 
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts_vec.f
new file mode 100644
index 0000000..e66750d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/buts_vec.f
@@ -0,0 +1,323 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 v, tv,
+     >                 d, udx, udy, udz,
+     >                 ist, iend, jst, jend,
+     >                 lst, lend )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the regular-sparse, block upper triangular solution:
+c
+c                     v <-- ( U-inv ) * v
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v( 5,ldmx/2*2+1, ldmy/2*2+1, *), 
+     >        tv( 5, ldmx/2*2+1, ldmy),
+     >        d( 5, 5, ldmx/2*2+1, ldmy),
+     >        udx( 5, 5, ldmx/2*2+1, ldmy),
+     >        udy( 5, 5, ldmx/2*2+1, ldmy),
+     >        udz( 5, 5, ldmx/2*2+1, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer lst, lend
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) = 
+     >      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )
+     >               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )
+     >               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )
+     >               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )
+     >               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do l = lend, lst, -1
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+                  tv( 1, i, j ) = tv( 1, i, j )
+     > + omega * ( udy( 1, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 1, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 1, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 1, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 1, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 1, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 1, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 1, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 1, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 1, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 2, i, j ) = tv( 2, i, j )
+     > + omega * ( udy( 2, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 2, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 2, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 2, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 2, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 2, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 2, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 2, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 2, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 2, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 3, i, j ) = tv( 3, i, j )
+     > + omega * ( udy( 3, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 3, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 3, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 3, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 3, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 3, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 3, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 3, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 3, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 3, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 4, i, j ) = tv( 4, i, j )
+     > + omega * ( udy( 4, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 4, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 4, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 4, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 4, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 4, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 4, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 4, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 4, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 4, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 5, i, j ) = tv( 5, i, j )
+     > + omega * ( udy( 5, 1, i, j ) * v( 1, i, j+1, k )
+     >           + udx( 5, 1, i, j ) * v( 1, i+1, j, k )
+     >           + udy( 5, 2, i, j ) * v( 2, i, j+1, k )
+     >           + udx( 5, 2, i, j ) * v( 2, i+1, j, k )
+     >           + udy( 5, 3, i, j ) * v( 3, i, j+1, k )
+     >           + udx( 5, 3, i, j ) * v( 3, i+1, j, k )
+     >           + udy( 5, 4, i, j ) * v( 4, i, j+1, k )
+     >           + udx( 5, 4, i, j ) * v( 4, i+1, j, k )
+     >           + udy( 5, 5, i, j ) * v( 5, i, j+1, k )
+     >           + udx( 5, 5, i, j ) * v( 5, i+1, j, k ) )
+!            end do
+
+c---------------------------------------------------------------------
+c   diagonal block inversion
+c---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )
+     >           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )
+     >           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )
+     >           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )
+     >           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )
+     >        - tv( 4, i, j ) * tmp
+
+c---------------------------------------------------------------------
+c   back substitution
+c---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )
+     >                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )
+     >           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )
+     >                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )
+     >           - tmat( 3, 4 ) * tv( 4, i, j )
+     >           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )
+     >                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )
+     >           - tmat( 2, 3 ) * tv( 3, i, j )
+     >           - tmat( 2, 4 ) * tv( 4, i, j )
+     >           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )
+     >                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )
+     >           - tmat( 1, 2 ) * tv( 2, i, j )
+     >           - tmat( 1, 3 ) * tv( 3, i, j )
+     >           - tmat( 1, 4 ) * tv( 4, i, j )
+     >           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )
+     >                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+        enddo
+      end do
+
+ 
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/domain.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/domain.f
new file mode 100644
index 0000000..679ac61
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/domain.f
@@ -0,0 +1,68 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine domain
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+
+
+      nx = nx0
+      ny = ny0
+      nz = nz0
+
+c---------------------------------------------------------------------
+c   check the sub-domain size
+c---------------------------------------------------------------------
+      if ( ( nx .lt. 4 ) .or.
+     >     ( ny .lt. 4 ) .or.
+     >     ( nz .lt. 4 ) ) then
+         write (*,2001) nx, ny, nz
+ 2001    format (5x,'SUBDOMAIN SIZE IS TOO SMALL - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE GREATER THAN OR EQUAL',
+     >        /5x,'TO 4 THEY ARE CURRENTLY', 3I3)
+         stop
+      end if
+
+      if ( ( nx .gt. isiz1 ) .or.
+     >     ( ny .gt. isiz2 ) .or.
+     >     ( nz .gt. isiz3 ) ) then
+         write (*,2002) nx, ny, nz
+ 2002    format (5x,'SUBDOMAIN SIZE IS TOO LARGE - ',
+     >        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',
+     >        /5x,'SO THAT NX, NY AND NZ ARE LESS THAN OR EQUAL TO ',
+     >        /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY.  THEY ARE',
+     >        /5x,'CURRENTLY', 3I4)
+         stop
+      end if
+
+c---------------------------------------------------------------------
+c   set up the start and end in i and j extents for all processors
+c---------------------------------------------------------------------
+      ist = 2
+      iend = nx - 1
+
+      jst = 2
+      jend = ny - 1
+
+      ii1 = 2
+      ii2 = nx0 - 1
+      ji1 = 2
+      ji2 = ny0 - 2
+      ki1 = 3
+      ki2 = nz0 - 1
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/erhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/erhs.f
new file mode 100644
index 0000000..2d6ea1e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/erhs.f
@@ -0,0 +1,434 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine erhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the right hand side based on exact solution
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  frct( m, i, j, k ) = 0.0d+00
+               end do
+            end do
+         end do
+      end do
+
+      do k = 1, nz
+         zeta = ( dble(k-1) ) / ( nz - 1 )
+         do j = 1, ny
+            eta = ( dble(j-1) ) / ( ny0 - 1 )
+            do i = 1, nx
+               xi = ( dble(i-1) ) / ( nx0 - 1 )
+               do m = 1, 5
+                  rsd(m,i,j,k) =  ce(m,1)
+     >                 + (ce(m,2)
+     >                 + (ce(m,5)
+     >                 + (ce(m,8)
+     >                 +  ce(m,11) * xi) * xi) * xi) * xi
+     >                 + (ce(m,3)
+     >                 + (ce(m,6)
+     >                 + (ce(m,9)
+     >                 +  ce(m,12) * eta) * eta) * eta) * eta
+     >                 + (ce(m,4)
+     >                 + (ce(m,7)
+     >                 + (ce(m,10)
+     >                 +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+               end do
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = rsd(2,i,j,k)
+               u21 = rsd(2,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,i) = rsd(2,i,j,k) * u21 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(3,i) = rsd(3,i,j,k) * u21
+               flux(4,i) = rsd(4,i,j,k) * u21
+               flux(5,i) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                   - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+            do i = ist, nx
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21i = tmp * rsd(2,i,j,k)
+               u31i = tmp * rsd(3,i,j,k)
+               u41i = tmp * rsd(4,i,j,k)
+               u51i = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i-1,j,k)
+
+               u21im1 = tmp * rsd(2,i-1,j,k)
+               u31im1 = tmp * rsd(3,i-1,j,k)
+               u41im1 = tmp * rsd(4,i-1,j,k)
+               u51im1 = tmp * rsd(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * 
+     >                        ( u21i - u21im1 )
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dx1 * tx1 * (            rsd(1,i-1,j,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i+1,j,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            rsd(2,i-1,j,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i+1,j,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            rsd(3,i-1,j,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i+1,j,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >            + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            rsd(4,i-1,j,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i+1,j,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >           + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            rsd(5,i-1,j,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,2,j,k) = frct(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * rsd(m,2,j,k)
+     >                       - 4.0d+00 * rsd(m,3,j,k)
+     >                       +           rsd(m,4,j,k) )
+               frct(m,3,j,k) = frct(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * rsd(m,2,j,k)
+     >                       + 6.0d+00 * rsd(m,3,j,k)
+     >                       - 4.0d+00 * rsd(m,4,j,k)
+     >                       +           rsd(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (            rsd(m,i-2,j,k)
+     >                         - 4.0d+00 * rsd(m,i-1,j,k)
+     >                         + 6.0d+00 * rsd(m,i,j,k)
+     >                         - 4.0d+00 * rsd(m,i+1,j,k)
+     >                         +           rsd(m,i+2,j,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,nx-2,j,k) = frct(m,nx-2,j,k)
+     >           - dssp * (             rsd(m,nx-4,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-3,j,k)
+     >                       + 6.0d+00 * rsd(m,nx-2,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-1,j,k)  )
+               frct(m,nx-1,j,k) = frct(m,nx-1,j,k)
+     >           - dssp * (             rsd(m,nx-3,j,k)
+     >                       - 4.0d+00 * rsd(m,nx-2,j,k)
+     >                       + 5.0d+00 * rsd(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = rsd(3,i,j,k)
+               u31 = rsd(3,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,j) = rsd(2,i,j,k) * u31 
+               flux(3,j) = rsd(3,i,j,k) * u31 + c2 * 
+     >                       ( rsd(5,i,j,k) - q )
+               flux(4,j) = rsd(4,i,j,k) * u31
+               flux(5,j) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                 - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21j = tmp * rsd(2,i,j,k)
+               u31j = tmp * rsd(3,i,j,k)
+               u41j = tmp * rsd(4,i,j,k)
+               u51j = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j-1,k)
+
+               u21jm1 = tmp * rsd(2,i,j-1,k)
+               u31jm1 = tmp * rsd(3,i,j-1,k)
+               u41jm1 = tmp * rsd(4,i,j-1,k)
+               u51jm1 = tmp * rsd(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * 
+     >                       ( u31j - u31jm1 )
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dy1 * ty1 * (            rsd(1,i,j-1,k)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j+1,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            rsd(2,i,j-1,k)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j+1,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            rsd(3,i,j-1,k)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j+1,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            rsd(4,i,j-1,k)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j+1,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            rsd(5,i,j-1,k)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j+1,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,2,k) = frct(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * rsd(m,i,2,k)
+     >                       - 4.0d+00 * rsd(m,i,3,k)
+     >                       +           rsd(m,i,4,k) )
+               frct(m,i,3,k) = frct(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * rsd(m,i,2,k)
+     >                       + 6.0d+00 * rsd(m,i,3,k)
+     >                       - 4.0d+00 * rsd(m,i,4,k)
+     >                       +           rsd(m,i,5,k) )
+            end do
+
+            do j = 4, ny - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (            rsd(m,i,j-2,k)
+     >                        - 4.0d+00 * rsd(m,i,j-1,k)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j+1,k)
+     >                        +           rsd(m,i,j+2,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,ny-2,k) = frct(m,i,ny-2,k)
+     >           - dssp * (             rsd(m,i,ny-4,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-3,k)
+     >                       + 6.0d+00 * rsd(m,i,ny-2,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-1,k)  )
+               frct(m,i,ny-1,k) = frct(m,i,ny-1,k)
+     >           - dssp * (             rsd(m,i,ny-3,k)
+     >                       - 4.0d+00 * rsd(m,i,ny-2,k)
+     >                       + 5.0d+00 * rsd(m,i,ny-1,k)  )
+            end do
+
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               flux(1,k) = rsd(4,i,j,k)
+               u41 = rsd(4,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)
+     >                         + rsd(3,i,j,k) * rsd(3,i,j,k)
+     >                         + rsd(4,i,j,k) * rsd(4,i,j,k) )
+     >                      / rsd(1,i,j,k)
+               flux(2,k) = rsd(2,i,j,k) * u41 
+               flux(3,k) = rsd(3,i,j,k) * u41 
+               flux(4,k) = rsd(4,i,j,k) * u41 + c2 * 
+     >                         ( rsd(5,i,j,k) - q )
+               flux(5,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)
+     >                  - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21k = tmp * rsd(2,i,j,k)
+               u31k = tmp * rsd(3,i,j,k)
+               u41k = tmp * rsd(4,i,j,k)
+               u51k = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j,k-1)
+
+               u21km1 = tmp * rsd(2,i,j,k-1)
+               u31km1 = tmp * rsd(3,i,j,k-1)
+               u41km1 = tmp * rsd(4,i,j,k-1)
+               u51km1 = tmp * rsd(5,i,j,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * ( u41k 
+     >                       - u41km1 )
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               frct(1,i,j,k) = frct(1,i,j,k)
+     >              + dz1 * tz1 * (            rsd(1,i,j,k+1)
+     >                             - 2.0d+00 * rsd(1,i,j,k)
+     >                             +           rsd(1,i,j,k-1) )
+               frct(2,i,j,k) = frct(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            rsd(2,i,j,k+1)
+     >                             - 2.0d+00 * rsd(2,i,j,k)
+     >                             +           rsd(2,i,j,k-1) )
+               frct(3,i,j,k) = frct(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            rsd(3,i,j,k+1)
+     >                             - 2.0d+00 * rsd(3,i,j,k)
+     >                             +           rsd(3,i,j,k-1) )
+               frct(4,i,j,k) = frct(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            rsd(4,i,j,k+1)
+     >                             - 2.0d+00 * rsd(4,i,j,k)
+     >                             +           rsd(4,i,j,k-1) )
+               frct(5,i,j,k) = frct(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            rsd(5,i,j,k+1)
+     >                             - 2.0d+00 * rsd(5,i,j,k)
+     >                             +           rsd(5,i,j,k-1) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,j,2) = frct(m,i,j,2)
+     >           - dssp * ( + 5.0d+00 * rsd(m,i,j,2)
+     >                       - 4.0d+00 * rsd(m,i,j,3)
+     >                       +           rsd(m,i,j,4) )
+               frct(m,i,j,3) = frct(m,i,j,3)
+     >           - dssp * (- 4.0d+00 * rsd(m,i,j,2)
+     >                      + 6.0d+00 * rsd(m,i,j,3)
+     >                      - 4.0d+00 * rsd(m,i,j,4)
+     >                      +           rsd(m,i,j,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)
+     >              - dssp * (           rsd(m,i,j,k-2)
+     >                        - 4.0d+00 * rsd(m,i,j,k-1)
+     >                        + 6.0d+00 * rsd(m,i,j,k)
+     >                        - 4.0d+00 * rsd(m,i,j,k+1)
+     >                        +           rsd(m,i,j,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,j,nz-2) = frct(m,i,j,nz-2)
+     >           - dssp * (            rsd(m,i,j,nz-4)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-3)
+     >                      + 6.0d+00 * rsd(m,i,j,nz-2)
+     >                      - 4.0d+00 * rsd(m,i,j,nz-1)  )
+               frct(m,i,j,nz-1) = frct(m,i,j,nz-1)
+     >           - dssp * (             rsd(m,i,j,nz-3)
+     >                       - 4.0d+00 * rsd(m,i,j,nz-2)
+     >                       + 5.0d+00 * rsd(m,i,j,nz-1)  )
+            end do
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/error.f
new file mode 100644
index 0000000..1f9ad22
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/error.f
@@ -0,0 +1,62 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine error
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the solution error
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  tmp
+      double precision  u000ijk(5)
+
+
+
+      do m = 1, 5
+         errnm(m) = 0.0d+00
+      end do
+
+      do k = 2, nz-1
+         do j = jst, jend
+            do i = ist, iend
+               call exact( i, j, k, u000ijk )
+               do m = 1, 5
+                  tmp = ( u000ijk(m) - u(m,i,j,k) )
+                  errnm(m) = errnm(m) + tmp ** 2
+               end do
+            end do
+         end do
+      end do
+
+      do m = 1, 5
+         errnm(m) = sqrt ( errnm(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+c        write (*,1002) ( errnm(m), m = 1, 5 )
+
+ 1002 format (1x/1x,'RMS-norm of error in soln. to ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of error in soln. to ',
+     > 'fifth pde  = ',1pe12.5)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/exact.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/exact.f
new file mode 100644
index 0000000..5a7c958
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/exact.f
@@ -0,0 +1,53 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact( i, j, k, u000ijk )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   compute the exact solution at (i,j,k)
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer i, j, k
+      double precision u000ijk(*)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer m
+      double precision xi, eta, zeta
+
+      xi  = ( dble ( i - 1 ) ) / ( nx0 - 1 )
+      eta  = ( dble ( j - 1 ) ) / ( ny0 - 1 )
+      zeta = ( dble ( k - 1 ) ) / ( nz - 1 )
+
+
+      do m = 1, 5
+         u000ijk(m) =  ce(m,1)
+     >        + (ce(m,2)
+     >        + (ce(m,5)
+     >        + (ce(m,8)
+     >        +  ce(m,11) * xi) * xi) * xi) * xi
+     >        + (ce(m,3)
+     >        + (ce(m,6)
+     >        + (ce(m,9)
+     >        +  ce(m,12) * eta) * eta) * eta) * eta
+     >        + (ce(m,4)
+     >        + (ce(m,7)
+     >        + (ce(m,10)
+     >        +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/inputlu.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/inputlu.data.sample
new file mode 100644
index 0000000..9ef5a7b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/inputlu.data.sample
@@ -0,0 +1,24 @@
+c
+c***controls printing of the progress of iterations: ipr    inorm
+                                                      1      250
+c
+c***the maximum no. of pseudo-time steps to be performed: nitmax
+                                                             250
+c
+c***magnitude of the time step: dt 
+                               2.0e+00
+c
+c***relaxation factor for SSOR iterations: omega
+                                            1.2
+c
+c***tolerance levels for steady-state residuals: tolnwt(m),m=1,5
+                             1.0e-08   1.0e-08   1.0e-08  1.0e-08  1.0e-08 
+c
+c***number of grid points in xi and eta and zeta directions: nx   ny   nz
+                                                            64  64  64
+c
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacld.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacld.f
new file mode 100644
index 0000000..e71b706
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacld.f
@@ -0,0 +1,356 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacld(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   compute the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) = -dt * 2.0d+00
+     >          * (  tx1 * r43 + ty1 + tz1  )
+     >          * c34 * tmp2 * u(2,i,j,k)
+               d(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 * c34 * tmp1 
+     >          * (  tx1 * r43 + ty1 + tz1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = -dt * 2.0d+00
+     >           * (  tx1 + ty1 * r43 + tz1  )
+     >           * c34 * tmp2 * u(3,i,j,k)
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 * r43 + tz1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = -dt * 2.0d+00
+     >           * (  tx1 + ty1 + tz1 * r43  )
+     >           * c34 * tmp2 * u(4,i,j,k)
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 + tz1 * r43 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = -dt * 2.0d+00
+     >  * ( ( ( tx1 * ( r43*c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( r43*c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )
+     >      ) * tmp3
+     >   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               d(5,2,i,j) = dt * 2.0d+00 * tmp2 * u(2,i,j,k)
+     > * ( tx1 * ( r43*c34 - c1345 )
+     >   + ty1 * (     c34 - c1345 )
+     >   + tz1 * (     c34 - c1345 ) )
+               d(5,3,i,j) = dt * 2.0d+00 * tmp2 * u(3,i,j,k)
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( r43*c34 -c1345 )
+     >   + tz1 * ( c34 - c1345 ) )
+               d(5,4,i,j) = dt * 2.0d+00 * tmp2 * u(4,i,j,k)
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( c34 - c1345 )
+     >   + tz1 * ( r43*c34 - c1345 ) )
+               d(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1  + ty1 + tz1 ) * c1345 * tmp1
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tz1 * dz1
+               a(1,2,i,j) =   0.0d+00
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) = - dt * tz2
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) = - dt * tz2
+     >           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               a(2,3,i,j) = 0.0d+00
+               a(2,4,i,j) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i,j) = 0.0d+00
+
+               a(3,1,i,j) = - dt * tz2
+     >           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i,j) = 0.0d+00
+               a(3,3,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               a(3,4,i,j) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = - dt * tz2
+     >        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2
+     >            + c2 * qs(i,j,k-1) * tmp1 )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i,j) = - dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i,j) = - dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k-1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               a(4,5,i,j) = - dt * tz2 * c2
+
+               a(5,1,i,j) = - dt * tz2
+     >       * ( ( c2 * 2.0d0 * qs(i,j,k-1)
+     >       - c1 * u(5,i,j,k-1) )
+     >            * u(4,i,j,k-1) * tmp2 )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i,j) = - dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i,j) = - dt * tz2
+     >       * ( c1 * ( u(5,i,j,k-1) * tmp1 )
+     >       - c2
+     >       * ( qs(i,j,k-1) * tmp1
+     >            + u(4,i,j,k-1)*u(4,i,j,k-1) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i,j) = - dt * tz2
+     >       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) = - dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) = - dt * ty2
+     >           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               b(2,3,i,j) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) = - dt * ty2
+     >           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2
+     >       + c2 * ( qs(i,j-1,k) * tmp1 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i,j) = - dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j-1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               b(3,4,i,j) = - dt * ty2
+     >                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i,j) = - dt * ty2 * c2
+
+               b(4,1,i,j) = - dt * ty2
+     >              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) = - dt * ty2
+     >          * ( ( c2 * 2.0d0 * qs(i,j-1,k)
+     >               - c1 * u(5,i,j-1,k) )
+     >          * ( u(3,i,j-1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)
+     >              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i,j) = - dt * ty2
+     >          * ( c1 * ( u(5,i,j-1,k) * tmp1 )
+     >          - c2 
+     >          * ( qs(i,j-1,k) * tmp1
+     >               + u(3,i,j-1,k)*u(3,i,j-1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i,j) = - dt * ty2
+     >          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i,j) = - dt * ty2
+     >          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tx1 * dx1
+               c(1,2,i,j) = - dt * tx2
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) =   0.0d+00
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2
+     >       + c2 * qs(i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i,j) = - dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               c(2,3,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i,j) = - dt * tx2
+     >              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i,j) = - dt * tx2 * c2 
+
+               c(3,1,i,j) = - dt * tx2
+     >              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i,j) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               c(3,4,i,j) = 0.0d+00
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = - dt * tx2
+     >          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i,j) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i,j) = 0.0d+00
+               c(4,4,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               c(4,5,i,j) = 0.0d+00
+
+               c(5,1,i,j) = - dt * tx2
+     >          * ( ( c2 * 2.0d0 * qs(i-1,j,k)
+     >              - c1 * u(5,i-1,j,k) )
+     >          * u(2,i-1,j,k) * tmp2 )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i,j) = - dt * tx2
+     >          * ( c1 * ( u(5,i-1,j,k) * tmp1 )
+     >             - c2
+     >             * ( u(2,i-1,j,k)*u(2,i-1,j,k) * tmp2
+     >                  + qs(i-1,j,k) * tmp1 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i,j) = - dt * tx2
+     >           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i,j) = - dt * tx2
+     >           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+            end do
+         end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacu.f
new file mode 100644
index 0000000..d7666b1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/jacu.f
@@ -0,0 +1,356 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine jacu(k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer k
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+c---------------------------------------------------------------------
+c   form the block daigonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00
+     >                       + dt * 2.0d+00 * (   tx1 * dx1
+     >                                          + ty1 * dy1
+     >                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) =  dt * 2.0d+00
+     >           * ( - tx1 * r43 - ty1 - tz1 )
+     >           * ( c34 * tmp2 * u(2,i,j,k) )
+               d(2,2,i,j) =  1.0d+00
+     >          + dt * 2.0d+00 * c34 * tmp1 
+     >          * (  tx1 * r43 + ty1 + tz1 )
+     >          + dt * 2.0d+00 * (   tx1 * dx2
+     >                             + ty1 * dy2
+     >                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = dt * 2.0d+00
+     >           * ( - tx1 - ty1 * r43 - tz1 )
+     >           * ( c34 * tmp2 * u(3,i,j,k) )
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 * r43 + tz1 )
+     >         + dt * 2.0d+00 * (  tx1 * dx3
+     >                           + ty1 * dy3
+     >                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = dt * 2.0d+00
+     >           * ( - tx1 - ty1 - tz1 * r43 )
+     >           * ( c34 * tmp2 * u(4,i,j,k) )
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00
+     >         + dt * 2.0d+00 * c34 * tmp1
+     >              * (  tx1 + ty1 + tz1 * r43 )
+     >         + dt * 2.0d+00 * (  tx1 * dx4
+     >                           + ty1 * dy4
+     >                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = -dt * 2.0d+00
+     >  * ( ( ( tx1 * ( r43*c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( r43*c34 - c1345 )
+     >     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )
+     >   + ( tx1 * ( c34 - c1345 )
+     >     + ty1 * ( c34 - c1345 )
+     >     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )
+     >      ) * tmp3
+     >   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               d(5,2,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( r43*c34 - c1345 )
+     >   + ty1 * (     c34 - c1345 )
+     >   + tz1 * (     c34 - c1345 ) ) * tmp2 * u(2,i,j,k)
+               d(5,3,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( r43*c34 -c1345 )
+     >   + tz1 * ( c34 - c1345 ) ) * tmp2 * u(3,i,j,k)
+               d(5,4,i,j) = dt * 2.0d+00
+     > * ( tx1 * ( c34 - c1345 )
+     >   + ty1 * ( c34 - c1345 )
+     >   + tz1 * ( r43*c34 - c1345 ) ) * tmp2 * u(4,i,j,k)
+               d(5,5,i,j) = 1.0d+00
+     >   + dt * 2.0d+00 * ( tx1 + ty1 + tz1 ) * c1345 * tmp1
+     >   + dt * 2.0d+00 * (  tx1 * dx5
+     >                    +  ty1 * dy5
+     >                    +  tz1 * dz5 )
+
+c---------------------------------------------------------------------
+c   form the first block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tx1 * dx1
+               a(1,2,i,j) =   dt * tx2
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) =   0.0d+00
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) =  dt * tx2
+     >          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2
+     >     + c2 * qs(i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               a(2,2,i,j) =  dt * tx2
+     >          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )
+     >          - dt * tx1 * ( r43 * c34 * tmp1 )
+     >          - dt * tx1 * dx2
+               a(2,3,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               a(2,4,i,j) =  dt * tx2
+     >              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               a(2,5,i,j) =  dt * tx2 * c2 
+
+               a(3,1,i,j) =  dt * tx2
+     >              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )
+     >         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               a(3,2,i,j) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               a(3,3,i,j) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx3
+               a(3,4,i,j) = 0.0d+00
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = dt * tx2
+     >          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )
+     >          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               a(4,2,i,j) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               a(4,3,i,j) = 0.0d+00
+               a(4,4,i,j) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )
+     >          - dt * tx1 * ( c34 * tmp1 )
+     >          - dt * tx1 * dx4
+               a(4,5,i,j) = 0.0d+00
+
+               a(5,1,i,j) = dt * tx2
+     >          * ( ( c2 * 2.0d0 * qs(i+1,j,k)
+     >              - c1 * u(5,i+1,j,k) )
+     >          * ( u(2,i+1,j,k) * tmp2 ) )
+     >          - dt * tx1
+     >          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )
+     >              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )
+     >              - c1345 * tmp2 * u(5,i+1,j,k) )
+               a(5,2,i,j) = dt * tx2
+     >          * ( c1 * ( u(5,i+1,j,k) * tmp1 )
+     >             - c2
+     >             * (  u(2,i+1,j,k)*u(2,i+1,j,k) * tmp2
+     >                  + qs(i+1,j,k) * tmp1 ) )
+     >           - dt * tx1
+     >           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               a(5,3,i,j) = dt * tx2
+     >           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               a(5,4,i,j) = dt * tx2
+     >           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )
+     >           - dt * tx1
+     >           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               a(5,5,i,j) = dt * tx2
+     >           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )
+     >           - dt * tx1 * c1345 * tmp1
+     >           - dt * tx1 * dx5
+
+c---------------------------------------------------------------------
+c   form the second block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) =  dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) =  dt * ty2
+     >           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               b(2,2,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >          - dt * ty1 * ( c34 * tmp1 )
+     >          - dt * ty1 * dy2
+               b(2,3,i,j) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) =  dt * ty2
+     >           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2
+     >      + c2 * ( qs(i,j+1,k) * tmp1 ) )
+     >       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               b(3,2,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               b(3,3,i,j) =  dt * ty2 * ( ( 2.0d+00 - c2 )
+     >                   * ( u(3,i,j+1,k) * tmp1 ) )
+     >       - dt * ty1 * ( r43 * c34 * tmp1 )
+     >       - dt * ty1 * dy3
+               b(3,4,i,j) =  dt * ty2
+     >                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               b(3,5,i,j) =  dt * ty2 * c2
+
+               b(4,1,i,j) =  dt * ty2
+     >              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               b(4,4,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )
+     >                        - dt * ty1 * ( c34 * tmp1 )
+     >                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) =  dt * ty2
+     >          * ( ( c2 * 2.0d0 * qs(i,j+1,k)
+     >               - c1 * u(5,i,j+1,k) )
+     >          * ( u(3,i,j+1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)
+     >              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)
+     >              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)
+     >              - c1345*tmp2*u(5,i,j+1,k) )
+               b(5,2,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1
+     >          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               b(5,3,i,j) =  dt * ty2
+     >          * ( c1 * ( u(5,i,j+1,k) * tmp1 )
+     >          - c2 
+     >          * ( qs(i,j+1,k) * tmp1
+     >               + u(3,i,j+1,k)*u(3,i,j+1,k) * tmp2 ) )
+     >          - dt * ty1
+     >          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               b(5,4,i,j) =  dt * ty2
+     >          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )
+     >          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               b(5,5,i,j) =  dt * ty2
+     >          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )
+     >          - dt * ty1 * c1345 * tmp1
+     >          - dt * ty1 * dy5
+
+c---------------------------------------------------------------------
+c   form the third block sub-diagonal
+c---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tz1 * dz1
+               c(1,2,i,j) =   0.0d+00
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) = dt * tz2
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = dt * tz2
+     >           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               c(2,2,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * c34 * tmp1
+     >           - dt * tz1 * dz2 
+               c(2,3,i,j) = 0.0d+00
+               c(2,4,i,j) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               c(2,5,i,j) = 0.0d+00
+
+               c(3,1,i,j) = dt * tz2
+     >           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               c(3,2,i,j) = 0.0d+00
+               c(3,3,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )
+     >           - dt * tz1 * ( c34 * tmp1 )
+     >           - dt * tz1 * dz3
+               c(3,4,i,j) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = dt * tz2
+     >        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2
+     >            + c2 * ( qs(i,j,k+1) * tmp1 ) )
+     >        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               c(4,2,i,j) = dt * tz2
+     >             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               c(4,3,i,j) = dt * tz2
+     >             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               c(4,4,i,j) = dt * tz2 * ( 2.0d+00 - c2 )
+     >             * ( u(4,i,j,k+1) * tmp1 )
+     >             - dt * tz1 * ( r43 * c34 * tmp1 )
+     >             - dt * tz1 * dz4
+               c(4,5,i,j) = dt * tz2 * c2
+
+               c(5,1,i,j) = dt * tz2
+     >     * ( ( c2 * 2.0d0 * qs(i,j,k+1)
+     >       - c1 * u(5,i,j,k+1) )
+     >            * ( u(4,i,j,k+1) * tmp2 ) )
+     >       - dt * tz1
+     >       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)
+     >           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)
+     >           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)
+     >          - c1345 * tmp2 * u(5,i,j,k+1) )
+               c(5,2,i,j) = dt * tz2
+     >       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               c(5,3,i,j) = dt * tz2
+     >       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )
+     >       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               c(5,4,i,j) = dt * tz2
+     >       * ( c1 * ( u(5,i,j,k+1) * tmp1 )
+     >       - c2
+     >       * ( qs(i,j,k+1) * tmp1
+     >            + u(4,i,j,k+1)*u(4,i,j,k+1) * tmp2 ) )
+     >       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               c(5,5,i,j) = dt * tz2
+     >       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )
+     >       - dt * tz1 * c1345 * tmp1
+     >       - dt * tz1 * dz5
+
+            end do
+         end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/l2norm.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/l2norm.f
new file mode 100644
index 0000000..f3bec22
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/l2norm.f
@@ -0,0 +1,57 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine l2norm ( ldx, ldy, ldz, 
+     >                    nx0, ny0, nz0,
+     >                    ist, iend, 
+     >                    jst, jend,
+     >                    v, sum )
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to compute the l2-norm of vector v.
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+c---------------------------------------------------------------------
+c  input parameters
+c---------------------------------------------------------------------
+      integer ldx, ldy, ldz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+c---------------------------------------------------------------------
+c   To improve cache performance, second two dimensions padded by 1 
+c   for even number sizes only.  Only needed in v.
+c---------------------------------------------------------------------
+      double precision  v(5,ldx/2*2+1,ldy/2*2+1,*), sum(5)
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+
+
+      do m = 1, 5
+         sum(m) = 0.0d+00
+      end do
+
+      do k = 2, nz0-1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  sum(m) = sum(m) + v(m,i,j,k) * v(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+
+      do m = 1, 5
+         sum(m) = sqrt ( sum(m) / ( (nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/lu.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/lu.f
new file mode 100644
index 0000000..4bc09f9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/lu.f
@@ -0,0 +1,195 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   L U                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB LU code.               !
+!    Refer to NAS Technical Reports 95-020 for details.                   !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+c---------------------------------------------------------------------
+c
+c Authors: S. Weeratunga
+c          V. Venkatakrishnan
+c          E. Barszcz
+c          M. Yarrow
+c
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+      program applu
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   driver for the performance evaluation of the solver for
+c   five coupled parabolic/elliptic partial differential equations.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+      character class
+      logical verified
+      double precision mflops
+
+      double precision t, tmax, timer_read, trecs(t_last)
+      external timer_read
+      integer i, fstatus
+      character t_names(t_last)*8
+
+c---------------------------------------------------------------------
+c     Setup info for timers
+c---------------------------------------------------------------------
+
+      open (unit=2,file='timer.flag',status='old',iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_jacld) = 'jacld'
+         t_names(t_blts) = 'blts'
+         t_names(t_jacu) = 'jacu'
+         t_names(t_buts) = 'buts'
+         t_names(t_add) = 'add'
+         t_names(t_l2norm) = 'l2norm'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+c---------------------------------------------------------------------
+c   read input data
+c---------------------------------------------------------------------
+      call read_input()
+
+c---------------------------------------------------------------------
+c   set up domain sizes
+c---------------------------------------------------------------------
+      call domain()
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      call setcoeff()
+
+c---------------------------------------------------------------------
+c   set the boundary values for dependent variables
+c---------------------------------------------------------------------
+      call setbv()
+
+c---------------------------------------------------------------------
+c   set the initial values for dependent variables
+c---------------------------------------------------------------------
+      call setiv()
+
+c---------------------------------------------------------------------
+c   compute the forcing term based on prescribed exact solution
+c---------------------------------------------------------------------
+      call erhs()
+
+c---------------------------------------------------------------------
+c   perform one SSOR iteration to touch all pages
+c---------------------------------------------------------------------
+      call ssor(1)
+
+c---------------------------------------------------------------------
+c   reset the boundary and initial values
+c---------------------------------------------------------------------
+      call setbv()
+      call setiv()
+
+c---------------------------------------------------------------------
+c   perform the SSOR iterations
+c---------------------------------------------------------------------
+      call ssor(itmax)
+
+c---------------------------------------------------------------------
+c   compute the solution error
+c---------------------------------------------------------------------
+      call error()
+
+c---------------------------------------------------------------------
+c   compute the surface integral
+c---------------------------------------------------------------------
+      call pintgr()
+
+c---------------------------------------------------------------------
+c   verification test
+c---------------------------------------------------------------------
+      call verify ( rsdnm, errnm, frc, class, verified )
+      mflops = float(itmax)*(1984.77*float( nx0 )
+     >     *float( ny0 )
+     >     *float( nz0 )
+     >     -10923.3*(float( nx0+ny0+nz0 )/3.)**2 
+     >     +27770.9* float( nx0+ny0+nz0 )/3.
+     >     -144010.)
+     >     / (maxtime*1000000.)
+
+      call print_results('LU', class, nx0,
+     >  ny0, nz0, itmax,
+     >  maxtime, mflops, '          floating point', verified, 
+     >  npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, 
+     >  '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      tmax = maxtime
+      if ( tmax .eq. 0. ) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         if (i.eq.t_rhs) then
+            t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+            write(*,820) 'sub-rhs', t, t*100./tmax
+            t = trecs(i) - t
+            write(*,820) 'rest-rhs', t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format(5x,'--> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/pintgr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/pintgr.f
new file mode 100644
index 0000000..2c59337
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/pintgr.f
@@ -0,0 +1,195 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine pintgr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k
+      integer ibeg, ifin, ifin1
+      integer jbeg, jfin, jfin1
+      double precision  phi1(0:isiz2+1,0:isiz3+1),
+     >                  phi2(0:isiz2+1,0:isiz3+1)
+      double precision  frc1, frc2, frc3
+
+
+
+c---------------------------------------------------------------------
+c   set up the sub-domains for integeration in each processor
+c---------------------------------------------------------------------
+      ibeg = ii1
+      ifin = ii2
+      jbeg = ji1
+      jfin = ji2
+      ifin1 = ifin - 1
+      jfin1 = jfin - 1
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+
+      do j = jbeg,jfin
+         do i = ibeg,ifin
+
+            k = ki1
+
+            phi1(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+
+            k = ki2
+
+            phi2(i,j) = c2*(  u(5,i,j,k)
+     >           - 0.50d+00 * (  u(2,i,j,k) ** 2
+     >                         + u(3,i,j,k) ** 2
+     >                         + u(4,i,j,k) ** 2 )
+     >                        / u(1,i,j,k) )
+         end do
+      end do
+
+
+      frc1 = 0.0d+00
+
+      do j = jbeg,jfin1
+         do i = ibeg, ifin1
+            frc1 = frc1 + (  phi1(i,j)
+     >                     + phi1(i+1,j)
+     >                     + phi1(i,j+1)
+     >                     + phi1(i+1,j+1)
+     >                     + phi2(i,j)
+     >                     + phi2(i+1,j)
+     >                     + phi2(i,j+1)
+     >                     + phi2(i+1,j+1) )
+         end do
+      end do
+
+
+      frc1 = dxi * deta * frc1
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      if (jbeg.eq.ji1) then
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              phi1(i,k) = c2*(  u(5,i,jbeg,k)
+     >             - 0.50d+00 * (  u(2,i,jbeg,k) ** 2
+     >                           + u(3,i,jbeg,k) ** 2
+     >                           + u(4,i,jbeg,k) ** 2 )
+     >                          / u(1,i,jbeg,k) )
+           end do
+        end do
+      end if
+
+      if (jfin.eq.ji2) then
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              phi2(i,k) = c2*(  u(5,i,jfin,k)
+     >             - 0.50d+00 * (  u(2,i,jfin,k) ** 2
+     >                           + u(3,i,jfin,k) ** 2
+     >                           + u(4,i,jfin,k) ** 2 )
+     >                          / u(1,i,jfin,k) )
+           end do
+        end do
+      end if
+
+
+      frc2 = 0.0d+00
+      do k = ki1, ki2-1
+         do i = ibeg, ifin1
+            frc2 = frc2 + (  phi1(i,k)
+     >                     + phi1(i+1,k)
+     >                     + phi1(i,k+1)
+     >                     + phi1(i+1,k+1)
+     >                     + phi2(i,k)
+     >                     + phi2(i+1,k)
+     >                     + phi2(i,k+1)
+     >                     + phi2(i+1,k+1) )
+         end do
+      end do
+
+
+      frc2 = dxi * dzeta * frc2
+
+c---------------------------------------------------------------------
+c   initialize
+c---------------------------------------------------------------------
+      do i = 0,isiz2+1
+        do k = 0,isiz3+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      if (ibeg.eq.ii1) then
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              phi1(j,k) = c2*(  u(5,ibeg,j,k)
+     >             - 0.50d+00 * (  u(2,ibeg,j,k) ** 2
+     >                           + u(3,ibeg,j,k) ** 2
+     >                           + u(4,ibeg,j,k) ** 2 )
+     >                          / u(1,ibeg,j,k) )
+           end do
+        end do
+      end if
+
+      if (ifin.eq.ii2) then
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              phi2(j,k) = c2*(  u(5,ifin,j,k)
+     >             - 0.50d+00 * (  u(2,ifin,j,k) ** 2
+     >                           + u(3,ifin,j,k) ** 2
+     >                           + u(4,ifin,j,k) ** 2 )
+     >                          / u(1,ifin,j,k) )
+           end do
+        end do
+      end if
+
+
+      frc3 = 0.0d+00
+
+      do k = ki1, ki2-1
+         do j = jbeg, jfin1
+            frc3 = frc3 + (  phi1(j,k)
+     >                     + phi1(j+1,k)
+     >                     + phi1(j,k+1)
+     >                     + phi1(j+1,k+1)
+     >                     + phi2(j,k)
+     >                     + phi2(j+1,k)
+     >                     + phi2(j,k+1)
+     >                     + phi2(j+1,k+1) )
+         end do
+      end do
+
+
+      frc3 = deta * dzeta * frc3
+      frc = 0.25d+00 * ( frc1 + frc2 + frc3 )
+c      write (*,1001) frc
+
+      return
+
+c 1001 format (//5x,'surface integral = ',1pe12.5//)
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/read_input.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/read_input.f
new file mode 100644
index 0000000..312fccd
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/read_input.f
@@ -0,0 +1,114 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine read_input
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+      integer  fstatus
+
+
+c---------------------------------------------------------------------
+c    if input file does not exist, it uses defaults
+c       ipr = 1 for detailed progress output
+c       inorm = how often the norm is printed (once every inorm iterations)
+c       itmax = number of pseudo time steps
+c       dt = time step
+c       omega 1 over-relaxation factor for SSOR
+c       tolrsd = steady state residual tolerance levels
+c       nx, ny, nz = number of grid points in x, y, z directions
+c---------------------------------------------------------------------
+
+         write(*, 1000)
+
+         open (unit=3,file='inputlu.data',status='old',
+     >         access='sequential',form='formatted', iostat=fstatus)
+         if (fstatus .eq. 0) then
+
+            write(*, *) 'Reading from input file inputlu.data'
+
+            read (3,*)
+            read (3,*)
+            read (3,*) ipr, inorm
+            read (3,*)
+            read (3,*)
+            read (3,*) itmax
+            read (3,*)
+            read (3,*)
+            read (3,*) dt
+            read (3,*)
+            read (3,*)
+            read (3,*) omega
+            read (3,*)
+            read (3,*)
+            read (3,*) tolrsd(1),tolrsd(2),tolrsd(3),tolrsd(4),tolrsd(5)
+            read (3,*)
+            read (3,*)
+            read (3,*) nx0, ny0, nz0
+            close(3)
+         else
+            ipr = ipr_default
+            inorm = inorm_default
+            itmax = itmax_default
+            dt = dt_default
+            omega = omega_default
+            tolrsd(1) = tolrsd1_def
+            tolrsd(2) = tolrsd2_def
+            tolrsd(3) = tolrsd3_def
+            tolrsd(4) = tolrsd4_def
+            tolrsd(5) = tolrsd5_def
+            nx0 = isiz1
+            ny0 = isiz2
+            nz0 = isiz3
+         endif
+
+c---------------------------------------------------------------------
+c   check problem size
+c---------------------------------------------------------------------
+
+         if ( ( nx0 .lt. 4 ) .or.
+     >        ( ny0 .lt. 4 ) .or.
+     >        ( nz0 .lt. 4 ) ) then
+
+            write (*,2001)
+ 2001       format (5x,'PROBLEM SIZE IS TOO SMALL - ',
+     >           /5x,'SET EACH OF NX, NY AND NZ AT LEAST EQUAL TO 5')
+            stop
+
+         end if
+
+         if ( ( nx0 .gt. isiz1 ) .or.
+     >        ( ny0 .gt. isiz2 ) .or.
+     >        ( nz0 .gt. isiz3 ) ) then
+
+            write (*,2002)
+ 2002       format (5x,'PROBLEM SIZE IS TOO LARGE - ',
+     >           /5x,'NX, NY AND NZ SHOULD BE EQUAL TO ',
+     >           /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY')
+            stop
+
+         end if
+
+
+         write(*, 1001) nx0, ny0, nz0
+         write(*, 1002) itmax
+         write(*, *)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - LU Benchmark', /)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations: ', i4)
+
+
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs.f
new file mode 100644
index 0000000..ebb91a4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs.f
@@ -0,0 +1,434 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand sides
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  q
+      double precision  tmp, utmp(6,isiz3), rtmp(5,isiz3)
+      double precision  u21, u31, u41
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+
+      if (timeron) call timer_start(t_rhs)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+               tmp = 1.0d+00 / u(1,i,j,k)
+               rho_i(i,j,k) = tmp
+               qs(i,j,k) = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      * tmp
+            end do
+         end do
+      end do
+
+      if (timeron) call timer_start(t_rhsx)
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = u(2,i,j,k)
+               u21 = u(2,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,i) = u(2,i,j,k) * u21 + c2 * 
+     >                        ( u(5,i,j,k) - q )
+               flux(3,i) = u(3,i,j,k) * u21
+               flux(4,i) = u(4,i,j,k) * u21
+               flux(5,i) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                 - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+
+            do i = ist, nx
+               tmp = rho_i(i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dx1 * tx1 * (            u(1,i-1,j,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            u(2,i-1,j,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            u(3,i-1,j,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            u(4,i-1,j,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            u(5,i-1,j,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i+1,j,k) )
+            end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * u(m,2,j,k)
+     >                      - 4.0d+00 * u(m,3,j,k)
+     >                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * u(m,2,j,k)
+     >                      + 6.0d+00 * u(m,3,j,k)
+     >                      - 4.0d+00 * u(m,4,j,k)
+     >                      +           u(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i-2,j,k)
+     >                        - 4.0d+00 * u(m,i-1,j,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i+1,j,k)
+     >                        +           u(m,i+2,j,k) )
+               end do
+            end do
+
+
+            do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)
+     >           - dssp * (             u(m,nx-4,j,k)
+     >                      - 4.0d+00 * u(m,nx-3,j,k)
+     >                      + 6.0d+00 * u(m,nx-2,j,k)
+     >                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)
+     >           - dssp * (             u(m,nx-3,j,k)
+     >                      - 4.0d+00 * u(m,nx-2,j,k)
+     >                      + 5.0d+00 * u(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = u(3,i,j,k)
+               u31 = u(3,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,j) = u(2,i,j,k) * u31 
+               flux(3,j) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,j) = u(4,i,j,k) * u31
+               flux(5,j) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                   - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = rho_i(i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dy1 * ty1 * (            u(1,i,j-1,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            u(2,i,j-1,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            u(3,i,j-1,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            u(4,i,j-1,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            u(5,i,j-1,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * u(m,i,2,k)
+     >                      - 4.0d+00 * u(m,i,3,k)
+     >                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * u(m,i,2,k)
+     >                      + 6.0d+00 * u(m,i,3,k)
+     >                      - 4.0d+00 * u(m,i,4,k)
+     >                      +           u(m,i,5,k) )
+            end do
+         end do
+
+         do j = 4, ny - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j-2,k)
+     >                        - 4.0d+00 * u(m,i,j-1,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j+1,k)
+     >                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)
+     >           - dssp * (             u(m,i,ny-4,k)
+     >                      - 4.0d+00 * u(m,i,ny-3,k)
+     >                      + 6.0d+00 * u(m,i,ny-2,k)
+     >                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)
+     >           - dssp * (             u(m,i,ny-3,k)
+     >                      - 4.0d+00 * u(m,i,ny-2,k)
+     >                      + 5.0d+00 * u(m,i,ny-1,k) )
+            end do
+         end do
+
+      end do
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               utmp(1,k) = u(1,i,j,k)
+               utmp(2,k) = u(2,i,j,k)
+               utmp(3,k) = u(3,i,j,k)
+               utmp(4,k) = u(4,i,j,k)
+               utmp(5,k) = u(5,i,j,k)
+               utmp(6,k) = rho_i(i,j,k)
+            end do
+            do k = 1, nz
+               flux(1,k) = utmp(4,k)
+               u41 = utmp(4,k) * utmp(6,k)
+
+               q = qs(i,j,k)
+
+               flux(2,k) = utmp(2,k) * u41 
+               flux(3,k) = utmp(3,k) * u41 
+               flux(4,k) = utmp(4,k) * u41 + c2 * (utmp(5,k)-q)
+               flux(5,k) = ( c1 * utmp(5,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  rtmp(m,k) =  rsd(m,i,j,k)
+     >                - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = utmp(6,k)
+
+               u21k = tmp * utmp(2,k)
+               u31k = tmp * utmp(3,k)
+               u41k = tmp * utmp(4,k)
+               u51k = tmp * utmp(5,k)
+
+               tmp = utmp(6,k-1)
+
+               u21km1 = tmp * utmp(2,k-1)
+               u31km1 = tmp * utmp(3,k-1)
+               u41km1 = tmp * utmp(4,k-1)
+               u51km1 = tmp * utmp(5,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               rtmp(1,k) = rtmp(1,k)
+     >              + dz1 * tz1 * (            utmp(1,k-1)
+     >                             - 2.0d+00 * utmp(1,k)
+     >                             +           utmp(1,k+1) )
+               rtmp(2,k) = rtmp(2,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            utmp(2,k-1)
+     >                             - 2.0d+00 * utmp(2,k)
+     >                             +           utmp(2,k+1) )
+               rtmp(3,k) = rtmp(3,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            utmp(3,k-1)
+     >                             - 2.0d+00 * utmp(3,k)
+     >                             +           utmp(3,k+1) )
+               rtmp(4,k) = rtmp(4,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            utmp(4,k-1)
+     >                             - 2.0d+00 * utmp(4,k)
+     >                             +           utmp(4,k+1) )
+               rtmp(5,k) = rtmp(5,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            utmp(5,k-1)
+     >                             - 2.0d+00 * utmp(5,k)
+     >                             +           utmp(5,k+1) )
+            end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,i,j,2) = rtmp(m,2)
+     >           - dssp * ( + 5.0d+00 * utmp(m,2)
+     >                      - 4.0d+00 * utmp(m,3)
+     >                      +           utmp(m,4) )
+               rsd(m,i,j,3) = rtmp(m,3)
+     >           - dssp * ( - 4.0d+00 * utmp(m,2)
+     >                      + 6.0d+00 * utmp(m,3)
+     >                      - 4.0d+00 * utmp(m,4)
+     >                      +           utmp(m,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rtmp(m,k)
+     >              - dssp * (            utmp(m,k-2)
+     >                        - 4.0d+00 * utmp(m,k-1)
+     >                        + 6.0d+00 * utmp(m,k)
+     >                        - 4.0d+00 * utmp(m,k+1)
+     >                        +           utmp(m,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rtmp(m,nz-2)
+     >           - dssp * (             utmp(m,nz-4)
+     >                      - 4.0d+00 * utmp(m,nz-3)
+     >                      + 6.0d+00 * utmp(m,nz-2)
+     >                      - 4.0d+00 * utmp(m,nz-1)  )
+               rsd(m,i,j,nz-1) = rtmp(m,nz-1)
+     >           - dssp * (             utmp(m,nz-3)
+     >                      - 4.0d+00 * utmp(m,nz-2)
+     >                      + 5.0d+00 * utmp(m,nz-1) )
+            end do
+         end do
+      end do
+      if (timeron) call timer_stop(t_rhsz)
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs_vec.f
new file mode 100644
index 0000000..fb6d12a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/rhs_vec.f
@@ -0,0 +1,438 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   compute the right hand sides
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  q
+      double precision  tmp
+      double precision  u21, u31, u41
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+
+      if (timeron) call timer_start(t_rhs)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+               tmp = 1.0d+00 / u(1,i,j,k)
+               rho_i(i,j,k) = tmp
+               qs(i,j,k) = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)
+     >                         + u(3,i,j,k) * u(3,i,j,k)
+     >                         + u(4,i,j,k) * u(4,i,j,k) )
+     >                      * tmp
+            end do
+         end do
+      end do
+
+      if (timeron) call timer_start(t_rhsx)
+c---------------------------------------------------------------------
+c   xi-direction flux differences
+c---------------------------------------------------------------------
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = u(2,i,j,k)
+               u21 = u(2,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,i) = u(2,i,j,k) * u21 + c2 * 
+     >                        ( u(5,i,j,k) - q )
+               flux(3,i) = u(3,i,j,k) * u21
+               flux(4,i) = u(4,i,j,k) * u21
+               flux(5,i) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                 - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+
+            do i = ist, nx
+               tmp = rho_i(i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )
+     >                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tx3 * ( u21i**2 - u21im1**2 )
+     >              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dx1 * tx1 * (            u(1,i-1,j,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )
+     >              + dx2 * tx1 * (            u(2,i-1,j,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )
+     >              + dx3 * tx1 * (            u(3,i-1,j,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )
+     >              + dx4 * tx1 * (            u(4,i-1,j,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )
+     >              + dx5 * tx1 * (            u(5,i-1,j,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i+1,j,k) )
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   Fourth-order dissipation
+c---------------------------------------------------------------------
+         do j = jst, jend
+            do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)
+     >           - dssp * ( + 5.0d+00 * u(m,2,j,k)
+     >                      - 4.0d+00 * u(m,3,j,k)
+     >                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)
+     >           - dssp * ( - 4.0d+00 * u(m,2,j,k)
+     >                      + 6.0d+00 * u(m,3,j,k)
+     >                      - 4.0d+00 * u(m,4,j,k)
+     >                      +           u(m,5,j,k) )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = 4, nx - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i-2,j,k)
+     >                        - 4.0d+00 * u(m,i-1,j,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i+1,j,k)
+     >                        +           u(m,i+2,j,k) )
+               end do
+            end do
+         end do
+
+
+         do j = jst, jend
+            do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)
+     >           - dssp * (             u(m,nx-4,j,k)
+     >                      - 4.0d+00 * u(m,nx-3,j,k)
+     >                      + 6.0d+00 * u(m,nx-2,j,k)
+     >                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)
+     >           - dssp * (             u(m,nx-3,j,k)
+     >                      - 4.0d+00 * u(m,nx-2,j,k)
+     >                      + 5.0d+00 * u(m,nx-1,j,k) )
+            end do
+         end do
+
+      end do
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+c---------------------------------------------------------------------
+c   eta-direction flux differences
+c---------------------------------------------------------------------
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = u(3,i,j,k)
+               u31 = u(3,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,j) = u(2,i,j,k) * u31 
+               flux(3,j) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,j) = u(4,i,j,k) * u31
+               flux(5,j) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                   - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = rho_i(i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )
+     >                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * ty3 * ( u31j**2 - u31jm1**2 )
+     >              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dy1 * ty1 * (            u(1,i,j-1,k)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )
+     >              + dy2 * ty1 * (            u(2,i,j-1,k)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )
+     >              + dy3 * ty1 * (            u(3,i,j-1,k)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )
+     >              + dy4 * ty1 * (            u(4,i,j-1,k)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )
+     >              + dy5 * ty1 * (            u(5,i,j-1,k)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)
+     >           - dssp * ( + 5.0d+00 * u(m,i,2,k)
+     >                      - 4.0d+00 * u(m,i,3,k)
+     >                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)
+     >           - dssp * ( - 4.0d+00 * u(m,i,2,k)
+     >                      + 6.0d+00 * u(m,i,3,k)
+     >                      - 4.0d+00 * u(m,i,4,k)
+     >                      +           u(m,i,5,k) )
+            end do
+         end do
+
+         do j = 4, ny - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j-2,k)
+     >                        - 4.0d+00 * u(m,i,j-1,k)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j+1,k)
+     >                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)
+     >           - dssp * (             u(m,i,ny-4,k)
+     >                      - 4.0d+00 * u(m,i,ny-3,k)
+     >                      + 6.0d+00 * u(m,i,ny-2,k)
+     >                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)
+     >           - dssp * (             u(m,i,ny-3,k)
+     >                      - 4.0d+00 * u(m,i,ny-2,k)
+     >                      + 5.0d+00 * u(m,i,ny-1,k) )
+            end do
+         end do
+
+      end do
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+c---------------------------------------------------------------------
+c   zeta-direction flux differences
+c---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               flux(1,k) = u(4,i,j,k)
+               u41 = u(4,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,k) = u(2,i,j,k) * u41 
+               flux(3,k) = u(3,i,j,k) * u41 
+               flux(4,k) = u(4,i,j,k) * u41 + c2 * (u(5,i,j,k)-q)
+               flux(5,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)
+     >                - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = rho_i(i,j,k)
+
+               u21k = tmp * u(2,i,j,k)
+               u31k = tmp * u(3,i,j,k)
+               u41k = tmp * u(4,i,j,k)
+               u51k = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j,k-1)
+
+               u21km1 = tmp * u(2,i,j,k-1)
+               u31km1 = tmp * u(3,i,j,k-1)
+               u41km1 = tmp * u(4,i,j,k-1)
+               u51km1 = tmp * u(5,i,j,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )
+     >              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )
+     >                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )
+     >              + (1.0d+00/6.0d+00)
+     >              * tz3 * ( u41k**2 - u41km1**2 )
+     >              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               rsd(1,i,j,k) = rsd(1,i,j,k)
+     >              + dz1 * tz1 * (            u(1,i,j,k-1)
+     >                             - 2.0d+00 * u(1,i,j,k)
+     >                             +           u(1,i,j,k+1) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )
+     >              + dz2 * tz1 * (            u(2,i,j,k-1)
+     >                             - 2.0d+00 * u(2,i,j,k)
+     >                             +           u(2,i,j,k+1) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )
+     >              + dz3 * tz1 * (            u(3,i,j,k-1)
+     >                             - 2.0d+00 * u(3,i,j,k)
+     >                             +           u(3,i,j,k+1) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )
+     >              + dz4 * tz1 * (            u(4,i,j,k-1)
+     >                             - 2.0d+00 * u(4,i,j,k)
+     >                             +           u(4,i,j,k+1) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)
+     >          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )
+     >              + dz5 * tz1 * (            u(5,i,j,k-1)
+     >                             - 2.0d+00 * u(5,i,j,k)
+     >                             +           u(5,i,j,k+1) )
+            end do
+         end do
+
+c---------------------------------------------------------------------
+c   fourth-order dissipation
+c---------------------------------------------------------------------
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,2) = rsd(m,i,j,2)
+     >           - dssp * ( + 5.0d+00 * u(m,i,j,2)
+     >                      - 4.0d+00 * u(m,i,j,3)
+     >                      +           u(m,i,j,4) )
+               rsd(m,i,j,3) = rsd(m,i,j,3)
+     >           - dssp * ( - 4.0d+00 * u(m,i,j,2)
+     >                      + 6.0d+00 * u(m,i,j,3)
+     >                      - 4.0d+00 * u(m,i,j,4)
+     >                      +           u(m,i,j,5) )
+            end do
+         end do
+
+         do k = 4, nz - 3
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)
+     >              - dssp * (            u(m,i,j,k-2)
+     >                        - 4.0d+00 * u(m,i,j,k-1)
+     >                        + 6.0d+00 * u(m,i,j,k)
+     >                        - 4.0d+00 * u(m,i,j,k+1)
+     >                        +           u(m,i,j,k+2) )
+               end do
+            end do
+         end do
+
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rsd(m,i,j,nz-2)
+     >           - dssp * (             u(m,i,j,nz-4)
+     >                      - 4.0d+00 * u(m,i,j,nz-3)
+     >                      + 6.0d+00 * u(m,i,j,nz-2)
+     >                      - 4.0d+00 * u(m,i,j,nz-1)  )
+               rsd(m,i,j,nz-1) = rsd(m,i,j,nz-1)
+     >           - dssp * (             u(m,i,j,nz-3)
+     >                      - 4.0d+00 * u(m,i,j,nz-2)
+     >                      + 5.0d+00 * u(m,i,j,nz-1) )
+            end do
+         end do
+      end do
+      if (timeron) call timer_stop(t_rhsz)
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setbv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setbv.f
new file mode 100644
index 0000000..54407d9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setbv.f
@@ -0,0 +1,67 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setbv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   set the boundary values of dependent variables
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c   local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision temp1(5), temp2(5)
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along the top and bottom faces
+c---------------------------------------------------------------------
+      do j = 1, ny
+         do i = 1, nx
+            call exact( i, j, 1, temp1 )
+            call exact( i, j, nz, temp2 )
+            do m = 1, 5
+               u( m, i, j, 1 ) = temp1(m)
+               u( m, i, j, nz ) = temp2(m)
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along north and south faces
+c---------------------------------------------------------------------
+      do k = 1, nz
+         do i = 1, nx
+            call exact( i, 1, k, temp1 )
+            call exact( i, ny, k, temp2 )
+            do m = 1, 5
+               u( m, i, 1, k ) = temp1(m)
+               u( m, i, ny, k ) = temp2(m)
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c   set the dependent variable values along east and west faces
+c---------------------------------------------------------------------
+      do k = 1, nz
+         do j = 1, ny
+            call exact( 1, j, k, temp1 )
+            call exact( nx, j, k, temp2 )
+            do m = 1, 5
+               u( m, 1, j, k ) = temp1(m)
+               u( m, nx, j, k ) = temp2(m)
+            end do
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setcoeff.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setcoeff.f
new file mode 100644
index 0000000..a1fb473
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setcoeff.f
@@ -0,0 +1,152 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setcoeff
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+c   set up coefficients
+c---------------------------------------------------------------------
+      dxi = 1.0d+00 / ( nx0 - 1 )
+      deta = 1.0d+00 / ( ny0 - 1 )
+      dzeta = 1.0d+00 / ( nz0 - 1 )
+
+      tx1 = 1.0d+00 / ( dxi * dxi )
+      tx2 = 1.0d+00 / ( 2.0d+00 * dxi )
+      tx3 = 1.0d+00 / dxi
+
+      ty1 = 1.0d+00 / ( deta * deta )
+      ty2 = 1.0d+00 / ( 2.0d+00 * deta )
+      ty3 = 1.0d+00 / deta
+
+      tz1 = 1.0d+00 / ( dzeta * dzeta )
+      tz2 = 1.0d+00 / ( 2.0d+00 * dzeta )
+      tz3 = 1.0d+00 / dzeta
+
+c---------------------------------------------------------------------
+c   diffusion coefficients
+c---------------------------------------------------------------------
+      dx1 = 0.75d+00
+      dx2 = dx1
+      dx3 = dx1
+      dx4 = dx1
+      dx5 = dx1
+
+      dy1 = 0.75d+00
+      dy2 = dy1
+      dy3 = dy1
+      dy4 = dy1
+      dy5 = dy1
+
+      dz1 = 1.00d+00
+      dz2 = dz1
+      dz3 = dz1
+      dz4 = dz1
+      dz5 = dz1
+
+c---------------------------------------------------------------------
+c   fourth difference dissipation
+c---------------------------------------------------------------------
+      dssp = ( max (dx1, dy1, dz1 ) ) / 4.0d+00
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the first pde
+c---------------------------------------------------------------------
+      ce(1,1) = 2.0d+00
+      ce(1,2) = 0.0d+00
+      ce(1,3) = 0.0d+00
+      ce(1,4) = 4.0d+00
+      ce(1,5) = 5.0d+00
+      ce(1,6) = 3.0d+00
+      ce(1,7) = 5.0d-01
+      ce(1,8) = 2.0d-02
+      ce(1,9) = 1.0d-02
+      ce(1,10) = 3.0d-02
+      ce(1,11) = 5.0d-01
+      ce(1,12) = 4.0d-01
+      ce(1,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the second pde
+c---------------------------------------------------------------------
+      ce(2,1) = 1.0d+00
+      ce(2,2) = 0.0d+00
+      ce(2,3) = 0.0d+00
+      ce(2,4) = 0.0d+00
+      ce(2,5) = 1.0d+00
+      ce(2,6) = 2.0d+00
+      ce(2,7) = 3.0d+00
+      ce(2,8) = 1.0d-02
+      ce(2,9) = 3.0d-02
+      ce(2,10) = 2.0d-02
+      ce(2,11) = 4.0d-01
+      ce(2,12) = 3.0d-01
+      ce(2,13) = 5.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the third pde
+c---------------------------------------------------------------------
+      ce(3,1) = 2.0d+00
+      ce(3,2) = 2.0d+00
+      ce(3,3) = 0.0d+00
+      ce(3,4) = 0.0d+00
+      ce(3,5) = 0.0d+00
+      ce(3,6) = 2.0d+00
+      ce(3,7) = 3.0d+00
+      ce(3,8) = 4.0d-02
+      ce(3,9) = 3.0d-02
+      ce(3,10) = 5.0d-02
+      ce(3,11) = 3.0d-01
+      ce(3,12) = 5.0d-01
+      ce(3,13) = 4.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fourth pde
+c---------------------------------------------------------------------
+      ce(4,1) = 2.0d+00
+      ce(4,2) = 2.0d+00
+      ce(4,3) = 0.0d+00
+      ce(4,4) = 0.0d+00
+      ce(4,5) = 0.0d+00
+      ce(4,6) = 2.0d+00
+      ce(4,7) = 3.0d+00
+      ce(4,8) = 3.0d-02
+      ce(4,9) = 5.0d-02
+      ce(4,10) = 4.0d-02
+      ce(4,11) = 2.0d-01
+      ce(4,12) = 1.0d-01
+      ce(4,13) = 3.0d-01
+
+c---------------------------------------------------------------------
+c   coefficients of the exact solution to the fifth pde
+c---------------------------------------------------------------------
+      ce(5,1) = 5.0d+00
+      ce(5,2) = 4.0d+00
+      ce(5,3) = 3.0d+00
+      ce(5,4) = 2.0d+00
+      ce(5,5) = 1.0d-01
+      ce(5,6) = 4.0d-01
+      ce(5,7) = 3.0d-01
+      ce(5,8) = 5.0d-02
+      ce(5,9) = 4.0d-02
+      ce(5,10) = 3.0d-02
+      ce(5,11) = 1.0d-01
+      ce(5,12) = 3.0d-01
+      ce(5,13) = 2.0d-01
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setiv.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setiv.f
new file mode 100644
index 0000000..03c99bf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/setiv.f
@@ -0,0 +1,60 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      subroutine setiv
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   set the initial values of independent variables based on tri-linear
+c   interpolation of boundary values in the computational space.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  pxi, peta, pzeta
+      double precision  ue_1jk(5),ue_nx0jk(5),ue_i1k(5),
+     >        ue_iny0k(5),ue_ij1(5),ue_ijnz(5)
+
+
+      do k = 2, nz - 1
+         zeta = ( dble (k-1) ) / (nz-1)
+         do j = 2, ny - 1
+            eta = ( dble (j-1) ) / (ny0-1)
+            do i = 2, nx - 1
+               xi = ( dble (i-1) ) / (nx0-1)
+               call exact (1,j,k,ue_1jk)
+               call exact (nx0,j,k,ue_nx0jk)
+               call exact (i,1,k,ue_i1k)
+               call exact (i,ny0,k,ue_iny0k)
+               call exact (i,j,1,ue_ij1)
+               call exact (i,j,nz,ue_ijnz)
+               do m = 1, 5
+                  pxi =   ( 1.0d+00 - xi ) * ue_1jk(m)
+     >                              + xi   * ue_nx0jk(m)
+                  peta =  ( 1.0d+00 - eta ) * ue_i1k(m)
+     >                              + eta   * ue_iny0k(m)
+                  pzeta = ( 1.0d+00 - zeta ) * ue_ij1(m)
+     >                              + zeta   * ue_ijnz(m)
+
+                  u( m, i, j, k ) = pxi + peta + pzeta
+     >                 - pxi * peta - peta * pzeta - pzeta * pxi
+     >                 + pxi * peta * pzeta
+
+               end do
+            end do
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor.f
new file mode 100644
index 0000000..ce1e344
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor.f
@@ -0,0 +1,260 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to perform pseudo-time stepping SSOR iterations
+c   for five nonlinear pde's.
+c---------------------------------------------------------------------
+
+      implicit none
+      integer niter
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m, n
+      integer istep
+      double precision  tmp, tv(5*isiz1*isiz2)
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+
+ 
+c---------------------------------------------------------------------
+c   begin pseudo-time stepping iterations
+c---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+c---------------------------------------------------------------------
+c   initialize a,b,c,d to zero (guarantees that page tables have been
+c   formed, if applicable on given architecture, before timestepping).
+c---------------------------------------------------------------------
+      do j=1,isiz2
+         do i=1,isiz1
+            do n=1,5
+               do m=1,5
+                  a(m,n,i,j) = 0.d0
+                  b(m,n,i,j) = 0.d0
+                  c(m,n,i,j) = 0.d0
+                  d(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+      call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the L2 norms of newton iteration residuals
+c---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >             ist, iend, jst, jend,
+     >             rsd, rsdnm )
+
+
+c      if ( ipr .eq. 1 ) then
+c         write (*,*) '          Initial residual norms'
+c         write (*,*)
+c         write (*,1007) ( rsdnm(m), m = 1, 5 )
+c         write (*,'(/a)') 'Iteration RMS-residual of 5th PDE'
+c      end if
+ 
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call timer_start(1)
+ 
+c---------------------------------------------------------------------
+c   the timestep loop
+c---------------------------------------------------------------------
+      do istep = 1, niter
+
+         
+c         if ( ( mod ( istep, inorm ) .eq. 0 ) .and.
+c     >          ipr .eq. 1 ) then
+c             write ( *, 1001 ) istep
+c         end if
+         if (mod ( istep, 20) .eq. 0 .or.
+     >         istep .eq. itmax .or.
+     >         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+c---------------------------------------------------------------------
+c   perform SSOR iteration
+c---------------------------------------------------------------------
+         if (timeron) call timer_start(t_rhs)
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = dt * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+         if (timeron) call timer_stop(t_rhs)
+ 
+         do k = 2, nz -1 
+c---------------------------------------------------------------------
+c   form the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_jacld)
+            call jacld(k)
+            if (timeron) call timer_stop(t_jacld)
+ 
+c---------------------------------------------------------------------
+c   perform the lower triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_blts)
+            call blts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, 
+     >                 a, b, c, d,
+     >                 ist, iend, jst, jend, 
+     >                 nx0, ny0 )
+            if (timeron) call timer_stop(t_blts)
+         end do
+ 
+         do k = nz - 1, 2, -1
+c---------------------------------------------------------------------
+c   form the strictly upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_jacu)
+            call jacu(k)
+            if (timeron) call timer_stop(t_jacu)
+
+c---------------------------------------------------------------------
+c   perform the upper triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_buts)
+            call buts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, tv,
+     >                 d, a, b, c,
+     >                 ist, iend, jst, jend,
+     >                 nx0, ny0 )
+            if (timeron) call timer_stop(t_buts)
+         end do
+ 
+c---------------------------------------------------------------------
+c   update the variables
+c---------------------------------------------------------------------
+
+         if (timeron) call timer_start(t_add)
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )
+     >                    + tmp * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+         if (timeron) call timer_stop(t_add)
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration corrections
+c---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1006) ( delunm(m), m = 1, 5 )
+c            else if ( ipr .eq. 2 ) then
+c                write (*,'(i5,f15.6)') istep,delunm(5)
+c            end if
+         end if
+ 
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+         call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration residuals
+c---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.
+     >        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1007) ( rsdnm(m), m = 1, 5 )
+c            end if
+         end if
+
+c---------------------------------------------------------------------
+c   check the newton-iteration residuals against the tolerance levels
+c---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.
+     >        ( rsdnm(2) .lt. tolrsd(2) ) .and.
+     >        ( rsdnm(3) .lt. tolrsd(3) ) .and.
+     >        ( rsdnm(4) .lt. tolrsd(4) ) .and.
+     >        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+c            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+c            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,
+     >   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor_vec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor_vec.f
new file mode 100644
index 0000000..f5c805e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/ssor_vec.f
@@ -0,0 +1,263 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   to perform pseudo-time stepping SSOR iterations
+c   for five nonlinear pde's.
+c---------------------------------------------------------------------
+
+      implicit none
+      integer niter
+
+      include 'applu.incl'
+
+c---------------------------------------------------------------------
+c  local variables
+c---------------------------------------------------------------------
+      integer i, j, k, m, n, lst, lend
+      integer istep
+      double precision  tmp, tv(5*isiz1*isiz2)
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+
+ 
+c---------------------------------------------------------------------
+c   begin pseudo-time stepping iterations
+c---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+c---------------------------------------------------------------------
+c   initialize a,b,c,d to zero (guarantees that page tables have been
+c   formed, if applicable on given architecture, before timestepping).
+c---------------------------------------------------------------------
+      do j=1,isiz2
+         do i=1,isiz1
+            do n=1,5
+               do m=1,5
+                  a(m,n,i,j) = 0.d0
+                  b(m,n,i,j) = 0.d0
+                  c(m,n,i,j) = 0.d0
+                  d(m,n,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+      call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the L2 norms of newton iteration residuals
+c---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >             ist, iend, jst, jend,
+     >             rsd, rsdnm )
+
+
+c      if ( ipr .eq. 1 ) then
+c         write (*,*) '          Initial residual norms'
+c         write (*,*)
+c         write (*,1007) ( rsdnm(m), m = 1, 5 )
+c         write (*,'(/a)') 'Iteration RMS-residual of 5th PDE'
+c      end if
+ 
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call timer_start(1)
+ 
+c---------------------------------------------------------------------
+c   the timestep loop
+c---------------------------------------------------------------------
+      do istep = 1, niter
+
+         
+c         if ( ( mod ( istep, inorm ) .eq. 0 ) .and.
+c     >          ipr .eq. 1 ) then
+c             write ( *, 1001 ) istep
+c         end if
+         if (mod ( istep, 20) .eq. 0 .or.
+     >         istep .eq. itmax .or.
+     >         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+c---------------------------------------------------------------------
+c   perform SSOR iteration
+c---------------------------------------------------------------------
+         if (timeron) call timer_start(t_rhs)
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = dt * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+         if (timeron) call timer_stop(t_rhs)
+ 
+         lst = ist + jst
+         lend = iend + jend
+
+         do k = 2, nz -1 
+c---------------------------------------------------------------------
+c   form the lower triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_jacld)
+            call jacld(k)
+            if (timeron) call timer_stop(t_jacld)
+ 
+c---------------------------------------------------------------------
+c   perform the lower triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_blts)
+            call blts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, 
+     >                 a, b, c, d,
+     >                 ist, iend, jst, jend, 
+     >                 lst, lend )
+            if (timeron) call timer_stop(t_blts)
+         end do
+ 
+         do k = nz - 1, 2, -1
+c---------------------------------------------------------------------
+c   form the strictly upper triangular part of the jacobian matrix
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_jacu)
+            call jacu(k)
+            if (timeron) call timer_stop(t_jacu)
+
+c---------------------------------------------------------------------
+c   perform the upper triangular solution
+c---------------------------------------------------------------------
+            if (timeron) call timer_start(t_buts)
+            call buts( isiz1, isiz2, isiz3,
+     >                 nx, ny, nz, k,
+     >                 omega,
+     >                 rsd, tv,
+     >                 d, a, b, c,
+     >                 ist, iend, jst, jend,
+     >                 lst, lend )
+            if (timeron) call timer_stop(t_buts)
+         end do
+ 
+c---------------------------------------------------------------------
+c   update the variables
+c---------------------------------------------------------------------
+
+         if (timeron) call timer_start(t_add)
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )
+     >                    + tmp * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+         if (timeron) call timer_stop(t_add)
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration corrections
+c---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1006) ( delunm(m), m = 1, 5 )
+c            else if ( ipr .eq. 2 ) then
+c                write (*,'(i5,f15.6)') istep,delunm(5)
+c            end if
+         end if
+ 
+c---------------------------------------------------------------------
+c   compute the steady-state residuals
+c---------------------------------------------------------------------
+         call rhs
+ 
+c---------------------------------------------------------------------
+c   compute the max-norms of newton iteration residuals
+c---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.
+     >        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,
+     >                   ist, iend, jst, jend,
+     >                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+c            if ( ipr .eq. 1 ) then
+c                write (*,1007) ( rsdnm(m), m = 1, 5 )
+c            end if
+         end if
+
+c---------------------------------------------------------------------
+c   check the newton-iteration residuals against the tolerance levels
+c---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.
+     >        ( rsdnm(2) .lt. tolrsd(2) ) .and.
+     >        ( rsdnm(3) .lt. tolrsd(3) ) .and.
+     >        ( rsdnm(4) .lt. tolrsd(4) ) .and.
+     >        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+c            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+c            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,
+     >   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of SSOR-iteration correction ',
+     > 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',
+     > 'first pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'second pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'third pde  = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fourth pde = ',1pe12.5/,
+     > 1x,'RMS-norm of steady-state residual for ',
+     > 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/verify.f
new file mode 100644
index 0000000..0628800
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/LU/verify.f
@@ -0,0 +1,408 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(xcr, xce, xci, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        implicit none
+        include 'applu.incl'
+
+        double precision xcr(5), xce(5), xci
+        double precision xcrref(5),xceref(5),xciref, 
+     >                   xcrdif(5),xcedif(5),xcidif,
+     >                   epsilon, dtref
+        integer m
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+        xciref = 1.0
+
+        if ( (nx0  .eq. 12     ) .and. 
+     >       (ny0  .eq. 12     ) .and.
+     >       (nz0  .eq. 12     ) .and.
+     >       (itmax   .eq. 50    ))  then
+
+           class = 'S'
+           dtref = 5.0d-1
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xcrref(1) = 1.6196343210976702d-02
+         xcrref(2) = 2.1976745164821318d-03
+         xcrref(3) = 1.5179927653399185d-03
+         xcrref(4) = 1.5029584435994323d-03
+         xcrref(5) = 3.4264073155896461d-02
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (12X12X12) grid,
+c   after 50 time steps, with  DT = 5.0d-01
+c---------------------------------------------------------------------
+         xceref(1) = 6.4223319957960924d-04
+         xceref(2) = 8.4144342047347926d-05
+         xceref(3) = 5.8588269616485186d-05
+         xceref(4) = 5.8474222595157350d-05
+         xceref(5) = 1.3103347914111294d-03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (12X12X12) grid,
+c   after 50 time steps, with DT = 5.0d-01
+c---------------------------------------------------------------------
+         xciref = 7.8418928865937083d+00
+
+
+        elseif ( (nx0 .eq. 33) .and. 
+     >           (ny0 .eq. 33) .and.
+     >           (nz0 .eq. 33) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'W'   !SPEC95fp size
+           dtref = 1.5d-3
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (33x33x33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xcrref(1) =   0.1236511638192d+02
+           xcrref(2) =   0.1317228477799d+01
+           xcrref(3) =   0.2550120713095d+01
+           xcrref(4) =   0.2326187750252d+01
+           xcrref(5) =   0.2826799444189d+02
+
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (33X33X33) grid,
+c---------------------------------------------------------------------
+           xceref(1) =   0.4867877144216d+00
+           xceref(2) =   0.5064652880982d-01
+           xceref(3) =   0.9281818101960d-01
+           xceref(4) =   0.8570126542733d-01
+           xceref(5) =   0.1084277417792d+01
+
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (33X33X33) grid,
+c   after 300 time steps, with  DT = 1.5d-3
+c---------------------------------------------------------------------
+           xciref    =   0.1161399311023d+02
+
+        elseif ( (nx0 .eq. 64) .and. 
+     >           (ny0 .eq. 64) .and.
+     >           (nz0 .eq. 64) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'A'
+           dtref = 2.0d+0
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 7.7902107606689367d+02
+         xcrref(2) = 6.3402765259692870d+01
+         xcrref(3) = 1.9499249727292479d+02
+         xcrref(4) = 1.7845301160418537d+02
+         xcrref(5) = 1.8384760349464247d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (64X64X64) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.9964085685471943d+01
+         xceref(2) = 2.8194576365003349d+00
+         xceref(3) = 7.3473412698774742d+00
+         xceref(4) = 6.7139225687777051d+00
+         xceref(5) = 7.0715315688392578d+01
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (64X64X64) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 2.6030925604886277d+01
+
+
+        elseif ( (nx0 .eq. 102) .and. 
+     >           (ny0 .eq. 102) .and.
+     >           (nz0 .eq. 102) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'B'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (102X102X102) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 3.5532672969982736d+03
+         xcrref(2) = 2.6214750795310692d+02
+         xcrref(3) = 8.8333721850952190d+02
+         xcrref(4) = 7.7812774739425265d+02
+         xcrref(5) = 7.3087969592545314d+03
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (102X102X102) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 1.1401176380212709d+02
+         xceref(2) = 8.1098963655421574d+00
+         xceref(3) = 2.8480597317698308d+01
+         xceref(4) = 2.5905394567832939d+01
+         xceref(5) = 2.6054907504857413d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (102X102X102) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 4.7887162703308227d+01
+
+        elseif ( (nx0 .eq. 162) .and. 
+     >           (ny0 .eq. 162) .and.
+     >           (nz0 .eq. 162) .and.
+     >           (itmax . eq. 250) ) then
+
+           class = 'C'
+           dtref = 2.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (162X162X162) grid,
+c   after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 1.03766980323537846d+04
+         xcrref(2) = 8.92212458801008552d+02
+         xcrref(3) = 2.56238814582660871d+03
+         xcrref(4) = 2.19194343857831427d+03
+         xcrref(5) = 1.78078057261061185d+04
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (162X162X162) 
+c   grid, after 250 time steps, with  DT = 2.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 2.15986399716949279d+02
+         xceref(2) = 1.55789559239863600d+01
+         xceref(3) = 5.41318863077207766d+01
+         xceref(4) = 4.82262643154045421d+01
+         xceref(5) = 4.55902910043250358d+02
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (162X162X162) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (162X162X162) grid,
+c   after 250 time steps, with DT = 2.0d+00
+c---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+        elseif ( (nx0 .eq. 408) .and. 
+     >           (ny0 .eq. 408) .and.
+     >           (nz0 .eq. 408) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'D'
+           dtref = 1.0d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (408X408X408) grid,
+c   after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.4868417937025d+05
+         xcrref(2) = 0.4696371050071d+04
+         xcrref(3) = 0.1218114549776d+05 
+         xcrref(4) = 0.1033801493461d+05
+         xcrref(5) = 0.7142398413817d+05
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (408X408X408) 
+c   grid, after 300 time steps, with  DT = 1.0d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.3752393004482d+03
+         xceref(2) = 0.3084128893659d+02
+         xceref(3) = 0.9434276905469d+02
+         xceref(4) = 0.8230686681928d+02
+         xceref(5) = 0.7002620636210d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (408X408X408) grid,
+c   after 300 time steps, with DT = 1.0d+00
+c---------------------------------------------------------------------
+         xciref =    0.8334101392503d+02
+
+        elseif ( (nx0 .eq. 1020) .and. 
+     >           (ny0 .eq. 1020) .and.
+     >           (nz0 .eq. 1020) .and.
+     >           (itmax . eq. 300) ) then
+
+           class = 'E'
+           dtref = 0.5d+0
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of residual, for the (1020X1020X1020) grid,
+c   after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xcrref(1) = 0.2099641687874d+06
+         xcrref(2) = 0.2130403143165d+05
+         xcrref(3) = 0.5319228789371d+05 
+         xcrref(4) = 0.4509761639833d+05
+         xcrref(5) = 0.2932360006590d+06
+
+c---------------------------------------------------------------------
+c   Reference values of RMS-norms of solution error, for the (1020X1020X1020) 
+c   grid, after 300 time steps, with  DT = 0.5d+00
+c---------------------------------------------------------------------
+         xceref(1) = 0.4800572578333d+03
+         xceref(2) = 0.4221993400184d+02
+         xceref(3) = 0.1210851906824d+03
+         xceref(4) = 0.1047888986770d+03
+         xceref(5) = 0.8363028257389d+03
+
+c---------------------------------------------------------------------
+c   Reference value of surface integral, for the (1020X1020X1020) grid,
+c   after 300 time steps, with DT = 0.5d+00
+c---------------------------------------------------------------------
+         xciref =    0.9512163272273d+02
+
+        else
+           verified = .FALSE.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+        xcidif = dabs((xci - xciref)/xciref)
+
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(/, ' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' Accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, 2x, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, 2x, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, 2x, E20.13)
+        
+        if (class .ne. 'U') then
+           write (*,2025)
+        else
+           write (*,2026)
+        endif
+ 2025   format(' Comparison of surface integral')
+ 2026   format(' Surface integral')
+
+
+        if (class .eq. 'U') then
+           write(*, 2030) xci
+        else if (xcidif .le. epsilon) then
+           write(*, 2032) xci, xciref, xcidif
+        else
+           verified = .false.
+           write(*, 2031) xci, xciref, xcidif
+        endif
+
+ 2030   format('          ', 4x, E20.13)
+ 2031   format(' FAILURE: ', 4x, E20.13, E20.13, E20.13)
+ 2032   format('          ', 4x, E20.13, E20.13, E20.13)
+
+
+
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/Makefile
new file mode 100644
index 0000000..6a3013f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=mg
+BENCHMARKU=MG
+
+include ../config/make.def
+
+OBJS = mg.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+mg.o:		mg.f globals.h npbparams.h
+	${FCOMPILE} mg.f
+
+clean:
+	- rm -f *.o *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/README
new file mode 100644
index 0000000..566d71d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/README
@@ -0,0 +1,141 @@
+Some info about the MG benchmark
+(Note: this info applies to the parallel version and mostly concerns
+the processor decomposition.  Info not concerning the decomposition
+still applies to the serial version.)
+================================
+    
+'mg_demo' demonstrates the capabilities of a very simple multigrid
+solver in computing a three dimensional potential field.  This is
+a simplified multigrid solver in two important respects:
+
+  (1) it solves only a constant coefficient equation,
+  and that only on a uniform cubical grid,
+    
+  (2) it solves only a single equation, representing
+  a scalar field rather than a vector field.
+
+We chose it for its portability and simplicity, and expect that a
+supercomputer which can run it effectively will also be able to
+run more complex multigrid programs at least as well.
+     
+     Eric Barszcz                         Paul Frederickson
+     RIACS
+     NASA Ames Research Center            NASA Ames Research Center
+
+========================================================================
+Running the program:  (Note: also see parameter lm information in the
+                       two sections immediately below this section)
+
+The program may be run with or without an input deck (called "mg.input"). 
+The following describes a few things about the input deck if you want to 
+use one. 
+
+The four lines below are the "mg.input" file required to run a
+problem of total size 256x256x256, for 4 iterations (Class "A"),
+and presumes the use of 8 processors:
+
+   8 = top level
+   256 256 256 = nx ny nz
+   4 = nit
+   0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+8 processors are solving this problem (recall that the number of 
+processors is specified to MPI as a run parameter, and MPI subsequently
+determines this for the code via an MPI subroutine call), a 2x2x2 
+processor grid is  formed, and thus each partition on a processor is 
+of size 128x128x128.  Therefore, a maximum of 8 multi-grid levels may 
+be used.  These are of size 128,64,32,16,8,4,2,1, with the coarsest 
+level being a single point on a given processor.
+
+
+Next, consider the same size problem but running on 1 processor.  The
+following "mg.input" file is appropriate:
+
+    9 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+Since this processor must solve the full 256x256x256 problem, this
+permits 9 multi-grid levels (256,128,64,32,16,8,4,2,1), resulting in 
+a coarsest multi-grid level of a single point on the processor
+
+
+Next, consider the same size problem but running on 2 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The algorithm for partitioning the full grid onto some power of 2 number 
+of processors is to start by splitting the last dimension of the grid
+(z dimension) in 2: the problem is now partitioned onto 2 processors.
+Next the middle dimension (y dimension) is split in 2: the problem is now
+partitioned onto 4 processors.  Next, first dimension (x dimension) is
+split in 2: the problem is now partitioned onto 8 processors.  Next, the
+last dimension (z dimension) is split again in 2: the problem is now
+partitioned onto 16 processors.  This partitioning is repeated until all 
+of the power of 2 processors have been allocated.
+
+Thus to run the above problem on 2 processors, the grid partitioning 
+algorithm will allocate the two processors across the last dimension, 
+creating two partitions each of size 256x256x128. The coarsest level of 
+multi-grid must be a single point surrounded by a cubic number of grid 
+points.  Therefore, each of the two processor partitions will contain 4 
+coarsest multi-grid level points, each surrounded by a cube of grid points 
+of size 128x128x128, indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 4 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The partitioning algorithm will create 4 partitions, each of size
+256x128x128.  Each partition will contain 2 coarsest multi-grid level
+points each surrounded by a cube of grid points of size 128x128x128, 
+indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 16 processors.  The
+following "mg.input" file is required:
+
+    7 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+On each node a partition of size 128x128x64 will be created.  A maximum
+of 7 multi-grid levels (64,32,16,8,4,2,1) may be used, resulting in each 
+partions containing 4 coarsest multi-grid level points, each surrounded 
+by a cube of grid points of size 64x64x64, indicated by a top level of 7.
+
+
+
+
+Note that non-cubic problem sizes may also be considered:
+
+The four lines below are the "mg.input" file appropriate for running a
+problem of total size 256x512x512, for 20 iterations and presumes the 
+use of 32 processors (note: this is NOT a class C problem):
+
+    8 = top level
+    256 512 512 = nx ny nz
+    20 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+32 processors are solving this problem, a 2x4x4 processor grid is
+formed, and thus each partition on a processor is of size 128x128x128.
+Therefore, a maximum of 8 multi-grid levels may be used.  These are of
+size 128,64,32,16,8,4,2,1, with the coarsest level being a single 
+point on a given processor.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/globals.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/globals.h
new file mode 100644
index 0000000..6179eaa
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/globals.h
@@ -0,0 +1,51 @@
+c---------------------------------------------------------------------
+c  Parameter lm (declared and set in "npbparams.h") is the log-base2 of 
+c  the edge size max for the partition on a given node, so must be changed 
+c  either to save space (if running a small case) or made bigger for larger 
+c  cases, for example, 512^3. Thus lm=7 means that the largest dimension 
+c  of a partition that can be solved on a node is 2^7 = 128. lm is set 
+c  automatically in npbparams.h
+c  Parameters ndim1, ndim2, ndim3 are the local problem dimensions. 
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer nm      ! actual dimension including ghost cells for communications
+c  ***  type of nv, nr and ir is set in npbparams.h
+c     >      , nv      ! size of rhs array
+c     >      , nr      ! size of residual array
+     >      , maxlevel! maximum number of levels
+
+      parameter( nm=2+2**lm, maxlevel=(lt_default+1) )
+      parameter( nv=one*(2+2**ndim1)*(2+2**ndim2)*(2+2**ndim3) )
+      parameter( nr = ((nv+nm**2+5*nm+7*lm+6)/7)*8 )
+c---------------------------------------------------------------------
+      integer  nx(maxlevel),ny(maxlevel),nz(maxlevel)
+      common /mg3/ nx,ny,nz
+
+      character class
+      common /ClassType/class
+
+      integer debug_vec(0:7)
+      common /my_debug/ debug_vec
+
+      integer m1(maxlevel), m2(maxlevel), m3(maxlevel)
+      integer lt, lb
+      common /fap/ ir(maxlevel),m1,m2,m3,lt,lb
+
+c---------------------------------------------------------------------
+c  Set at m=1024, can handle cases up to 1024^3 case
+c---------------------------------------------------------------------
+      integer m
+c      parameter( m=1037 )
+      parameter( m=nm+1 )
+
+      logical timeron
+      common /timers/ timeron
+      integer T_init, T_bench, T_psinv, T_resid, T_rprj3, T_interp,
+     >        T_norm2, T_mg3P, T_resid2, T_comm3, T_last
+      parameter (T_init=1, T_bench=2, T_mg3P=3,
+     >        T_psinv=4, T_resid=5, T_resid2=6, T_rprj3=7,
+     >        T_interp=8, T_norm2=9, T_comm3=10, T_last=10)
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.f
new file mode 100644
index 0000000..61859c2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.f
@@ -0,0 +1,1379 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   M G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB MG code.               !
+!    Refer to NAS Technical Reports 95-020 for details.                   !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Authors: E. Barszcz
+c          P. Frederickson
+c          A. Woo
+c          M. Yarrow
+c
+c---------------------------------------------------------------------
+
+
+c---------------------------------------------------------------------
+      program mg
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'globals.h'
+
+c---------------------------------------------------------------------------c
+c k is the current level. It is passed down through subroutine args
+c and is NOT global. it is the current iteration
+c---------------------------------------------------------------------------c
+
+      integer k, it
+      
+      external timer_read
+      double precision t, tinit, mflops, timer_read
+
+c---------------------------------------------------------------------------c
+c These arrays are in common because they are quite large
+c and probably shouldn't be allocated on the stack. They
+c are always passed as subroutine args. 
+c---------------------------------------------------------------------------c
+
+      double precision u(nr),v(nv),r(nr),a(0:3),c(0:3)
+      common /noautom/ u,v,r   
+
+      double precision rnm2, rnmu, old2, oldu, epsilon
+      integer n1, n2, n3, nit
+      double precision nn, verify_value, err
+      logical verified
+
+      integer i, fstatus
+      character t_names(t_last)*8
+      double precision tmax
+
+
+      do i = T_init, T_last
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_init)
+
+c---------------------------------------------------------------------
+c Read in and broadcast input data
+c---------------------------------------------------------------------
+
+      open(unit=7,file='timer.flag', status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_mg3P) = 'mg3P'
+         t_names(t_psinv) = 'psinv'
+         t_names(t_resid) = 'resid'
+         t_names(t_rprj3) = 'rprj3'
+         t_names(t_interp) = 'interp'
+         t_names(t_norm2) = 'norm2'
+         t_names(t_comm3) = 'comm3'
+         close(7)
+      else
+         timeron = .false.
+      endif
+
+      write (*, 1000) 
+
+      open(unit=7,file="mg.input", status="old", iostat=fstatus)
+      if (fstatus .eq. 0) then
+         write(*,50) 
+ 50      format(' Reading from input file mg.input')
+         read(7,*) lt
+         read(7,*) nx(lt), ny(lt), nz(lt)
+         read(7,*) nit
+         read(7,*) (debug_vec(i),i=0,7)
+      else
+         write(*,51) 
+ 51      format(' No input file. Using compiled defaults ')
+         lt = lt_default
+         nit = nit_default
+         nx(lt) = nx_default
+         ny(lt) = ny_default
+         nz(lt) = nz_default
+         do i = 0,7
+            debug_vec(i) = debug_default
+         end do
+      endif
+
+
+      if ( (nx(lt) .ne. ny(lt)) .or. (nx(lt) .ne. nz(lt)) ) then
+         Class = 'U' 
+      else if( nx(lt) .eq. 32 .and. nit .eq. 4 ) then
+         Class = 'S'
+      else if( nx(lt) .eq. 128 .and. nit .eq. 4 ) then
+         Class = 'W'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 4 ) then  
+         Class = 'A'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 20 ) then
+         Class = 'B'
+      else if( nx(lt) .eq. 512 .and. nit .eq. 20 ) then  
+         Class = 'C'
+      else if( nx(lt) .eq. 1024 .and. nit .eq. 50 ) then  
+         Class = 'D'
+      else if( nx(lt) .eq. 2048 .and. nit .eq. 50 ) then  
+         Class = 'E'
+      else
+         Class = 'U'
+      endif
+
+c---------------------------------------------------------------------
+c  Use these for debug info:
+c---------------------------------------------------------------------
+c     debug_vec(0) = 1 !=> report all norms
+c     debug_vec(1) = 1 !=> some setup information
+c     debug_vec(1) = 2 !=> more setup information
+c     debug_vec(2) = k => at level k or below, show result of resid
+c     debug_vec(3) = k => at level k or below, show result of psinv
+c     debug_vec(4) = k => at level k or below, show result of rprj
+c     debug_vec(5) = k => at level k or below, show result of interp
+c     debug_vec(6) = 1 => (unused)
+c     debug_vec(7) = 1 => (unused)
+c---------------------------------------------------------------------
+      a(0) = -8.0D0/3.0D0 
+      a(1) =  0.0D0 
+      a(2) =  1.0D0/6.0D0 
+      a(3) =  1.0D0/12.0D0
+      
+      if(Class .eq. 'A' .or. Class .eq. 'S'.or. Class .eq.'W') then
+c---------------------------------------------------------------------
+c     Coefficients for the S(a) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/8.0D0
+         c(1) =  +1.0D0/32.0D0
+         c(2) =  -1.0D0/64.0D0
+         c(3) =   0.0D0
+      else
+c---------------------------------------------------------------------
+c     Coefficients for the S(b) smoother
+c---------------------------------------------------------------------
+         c(0) =  -3.0D0/17.0D0
+         c(1) =  +1.0D0/33.0D0
+         c(2) =  -1.0D0/61.0D0
+         c(3) =   0.0D0
+      endif
+      lb = 1
+      k  = lt
+
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call norm2u3(v,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+c     write(*,*)
+c     write(*,*)' norms of random v are'
+c     write(*,600) 0, rnm2, rnmu
+c     write(*,*)' about to evaluate resid, k=',k
+
+      write (*, 1001) nx(lt),ny(lt),nz(lt), Class
+      write (*, 1002) nit
+      write (*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - MG Benchmark', /)
+ 1001 format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', A, ')' )
+ 1002 format(' Iterations: ', i3)
+
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+c---------------------------------------------------------------------
+c     One iteration for startup
+c---------------------------------------------------------------------
+      call mg3P(u,v,r,a,c,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call timer_stop(T_init)
+      tinit = timer_read(T_init)
+
+      write( *,'(A,F15.3,A/)' ) 
+     >     ' Initialization time: ',tinit, ' seconds'
+
+      do i = T_bench, T_last
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_bench)
+
+      if (timeron) call timer_start(T_resid2)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      if (timeron) call timer_stop(T_resid2)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+      do  it=1,nit
+         if (it.eq.1 .or. it.eq.nit .or. mod(it,5).eq.0) then
+            write(*,80) it
+   80       format('  iter ',i3)
+         endif
+         if (timeron) call timer_start(T_mg3P)
+         call mg3P(u,v,r,a,c,n1,n2,n3,k)
+         if (timeron) call timer_stop(T_mg3P)
+         if (timeron) call timer_start(T_resid2)
+         call resid(u,v,r,n1,n2,n3,a,k)
+         if (timeron) call timer_stop(T_resid2)
+      enddo
+
+
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      call timer_stop(T_bench)
+
+      t = timer_read(T_bench)
+
+      verified = .FALSE.
+      verify_value = 0.0
+
+      write(*,100)
+ 100  format(/' Benchmark completed ')
+
+      epsilon = 1.d-8
+      if (Class .ne. 'U') then
+         if(Class.eq.'S') then
+            verify_value = 0.5307707005734d-04
+         elseif(Class.eq.'W') then
+            verify_value = 0.6467329375339d-05
+         elseif(Class.eq.'A') then
+            verify_value = 0.2433365309069d-05
+         elseif(Class.eq.'B') then
+            verify_value = 0.1800564401355d-05
+         elseif(Class.eq.'C') then
+            verify_value = 0.5706732285740d-06
+         elseif(Class.eq.'D') then
+            verify_value = 0.1583275060440d-09
+         elseif(Class.eq.'E') then
+            verify_value = 0.8157592357404d-10
+         endif
+
+         err = abs( rnm2 - verify_value ) / verify_value
+c         err = abs( rnm2 - verify_value )
+         if( err .le. epsilon ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) rnm2
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' L2 Norm is ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300) 
+            write(*, 301) rnm2
+            write(*, 302) verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' L2 Norm is             ', E20.13)
+ 302        format(' The correct L2 Norm is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) rnm2
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+      nn = 1.0d0*nx(lt)*ny(lt)*nz(lt)
+
+      if( t .ne. 0. ) then
+         mflops = 58.*nit*nn*1.0D-6 /t
+      else
+         mflops = 0.0
+      endif
+
+      call print_results('MG', class, nx(lt), ny(lt), nz(lt), 
+     >                   nit, t,
+     >                   mflops, '          floating point', 
+     >                   verified, npbversion, compiletime,
+     >                   cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+ 600  format( i4, 2e19.12)
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(t_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=t_bench, t_last
+         t = timer_read(i)
+         if (i.eq.t_resid2) then
+            t = timer_read(T_resid) - t
+            write(*,820) 'mg-resid', t, t*100./tmax
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine setup(n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      include 'globals.h'
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1,n2,n3,k
+      integer j
+
+      integer ax, mi(3,maxlevel)
+      integer ng(3,maxlevel)
+
+
+      ng(1,lt) = nx(lt)
+      ng(2,lt) = ny(lt)
+      ng(3,lt) = nz(lt)
+      do  ax=1,3
+         do  k=lt-1,1,-1
+            ng(ax,k) = ng(ax,k+1)/2
+         enddo
+      enddo
+ 61   format(10i4)
+      do  k=lt,1,-1
+         nx(k) = ng(1,k)
+         ny(k) = ng(2,k)
+         nz(k) = ng(3,k)
+      enddo
+
+      do  k = lt,1,-1
+         do  ax = 1,3
+            mi(ax,k) = 2 + ng(ax,k) 
+         enddo
+
+         m1(k) = mi(1,k)
+         m2(k) = mi(2,k)
+         m3(k) = mi(3,k)
+
+      enddo
+
+      k = lt
+      is1 = 2 + ng(1,k) - ng(1,lt)
+      ie1 = 1 + ng(1,k)
+      n1 = 3 + ie1 - is1
+      is2 = 2 + ng(2,k) - ng(2,lt)
+      ie2 = 1 + ng(2,k) 
+      n2 = 3 + ie2 - is2
+      is3 = 2 + ng(3,k) - ng(3,lt)
+      ie3 = 1 + ng(3,k) 
+      n3 = 3 + ie3 - is3
+
+
+      ir(lt)=1
+      do  j = lt-1, 1, -1
+         ir(j)=ir(j+1)+one*m1(j+1)*m2(j+1)*m3(j+1)
+      enddo
+
+
+      if( debug_vec(1) .ge. 1 )then
+         write(*,*)' in setup, '
+         write(*,*)' k  lt  nx  ny  nz ',
+     >        ' n1  n2  n3 is1 is2 is3 ie1 ie2 ie3'
+         write(*,9) k,lt,ng(1,k),ng(2,k),ng(3,k),
+     >              n1,n2,n3,is1,is2,is3,ie1,ie2,ie3
+ 9       format(15i4)
+      endif
+
+      k = lt
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine mg3P(u,v,r,a,c,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     multigrid V-cycle routine
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, k
+      double precision u(nr),v(nv),r(nr)
+      double precision a(0:3),c(0:3)
+
+      integer j
+
+c---------------------------------------------------------------------
+c     down cycle.
+c     restrict the residual from the find grid to the coarse
+c---------------------------------------------------------------------
+
+      do  k= lt, lb+1 , -1
+         j = k-1
+         call rprj3(r(ir(k)),m1(k),m2(k),m3(k),
+     >        r(ir(j)),m1(j),m2(j),m3(j),k)
+      enddo
+
+      k = lb
+c---------------------------------------------------------------------
+c     compute an approximate solution on the coarsest grid
+c---------------------------------------------------------------------
+      call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+      call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+
+      do  k = lb+1, lt-1     
+          j = k-1
+c---------------------------------------------------------------------
+c        prolongate from level k-1  to k
+c---------------------------------------------------------------------
+         call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+         call interp(u(ir(j)),m1(j),m2(j),m3(j),
+     >               u(ir(k)),m1(k),m2(k),m3(k),k)
+c---------------------------------------------------------------------
+c        compute residual for level k
+c---------------------------------------------------------------------
+         call resid(u(ir(k)),r(ir(k)),r(ir(k)),m1(k),m2(k),m3(k),a,k)
+c---------------------------------------------------------------------
+c        apply smoother
+c---------------------------------------------------------------------
+         call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+      enddo
+ 200  continue
+      j = lt - 1
+      k = lt
+      call interp(u(ir(j)),m1(j),m2(j),m3(j),u,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call psinv(r,u,n1,n2,n3,c,k)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine psinv( r,u,n1,n2,n3,c,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     psinv applies an approximate inverse as smoother:  u = u + Cr
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Presuming coefficient c(3) is zero (the NPB assumes this,
+c     but it is thus not a general case), 2A + 1M may be eliminated,
+c     resulting in 13A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),r(n1,n2,n3),c(0:3)
+      integer i3, i2, i1
+
+      double precision r1(m), r2(m)
+
+      if (timeron) call timer_start(T_psinv)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3)
+     >                + r(i1,i2,i3-1) + r(i1,i2,i3+1)
+               r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)
+     >                + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               u(i1,i2,i3) = u(i1,i2,i3)
+     >                     + c(0) * r(i1,i2,i3)
+     >                     + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3)
+     >                              + r1(i1) )
+     >                     + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) )
+c---------------------------------------------------------------------
+c  Assume c(3) = 0    (Enable line below if c(3) not= 0)
+c---------------------------------------------------------------------
+c    >                     + c(3) * ( r2(i1-1) + r2(i1+1) )
+c---------------------------------------------------------------------
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_psinv)
+
+c---------------------------------------------------------------------
+c     exchange boundary points
+c---------------------------------------------------------------------
+      call comm3(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(u,n1,n2,n3,'   psinv',k)
+      endif
+
+      if( debug_vec(3) .ge. k )then
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine resid( u,v,r,n1,n2,n3,a,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     resid computes the residual:  r = v - Au
+c
+c     This  implementation costs  15A + 4M per result, where
+c     A and M denote the costs of Addition (or Subtraction) and 
+c     Multiplication, respectively. 
+c     Presuming coefficient a(1) is zero (the NPB assumes this,
+c     but it is thus not a general case), 3A + 1M may be eliminated,
+c     resulting in 12A + 3M.
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
+      integer i3, i2, i1
+      double precision u1(m), u2(m)
+
+      if (timeron) call timer_start(T_resid)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
+     >                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
+               u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
+     >                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               r(i1,i2,i3) = v(i1,i2,i3)
+     >                     - a(0) * u(i1,i2,i3)
+c---------------------------------------------------------------------
+c  Assume a(1) = 0      (Enable 2 lines below if a(1) not= 0)
+c---------------------------------------------------------------------
+c    >                     - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
+c    >                              + u1(i1) )
+c---------------------------------------------------------------------
+     >                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
+     >                     - a(3) * ( u2(i1-1) + u2(i1+1) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_resid)
+
+c---------------------------------------------------------------------
+c     exchange boundary data
+c---------------------------------------------------------------------
+      call comm3(r,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(r,n1,n2,n3,'   resid',k)
+      endif
+
+      if( debug_vec(2) .ge. k )then
+         call showall(r,n1,n2,n3)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rprj3( r,m1k,m2k,m3k,s,m1j,m2j,m3j,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     rprj3 projects onto the next coarser grid, 
+c     using a trilinear Finite Element projection:  s = r' = P r
+c     
+c     This  implementation costs  20A + 4M per result, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer m1k, m2k, m3k, m1j, m2j, m3j,k
+      double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j)
+      integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j
+
+      double precision x1(m), y1(m), x2,y2
+
+      if (timeron) call timer_start(T_rprj3)
+      if(m1k.eq.3)then
+        d1 = 2
+      else
+        d1 = 1
+      endif
+
+      if(m2k.eq.3)then
+        d2 = 2
+      else
+        d2 = 1
+      endif
+
+      if(m3k.eq.3)then
+        d3 = 2
+      else
+        d3 = 1
+      endif
+
+      do  j3=2,m3j-1
+         i3 = 2*j3-d3
+         do  j2=2,m2j-1
+            i2 = 2*j2-d2
+
+            do j1=2,m1j
+              i1 = 2*j1-d1
+              x1(i1-1) = r(i1-1,i2-1,i3  ) + r(i1-1,i2+1,i3  )
+     >                 + r(i1-1,i2,  i3-1) + r(i1-1,i2,  i3+1)
+              y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1)
+     >                 + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1)
+            enddo
+
+            do  j1=2,m1j-1
+              i1 = 2*j1-d1
+              y2 = r(i1,  i2-1,i3-1) + r(i1,  i2-1,i3+1)
+     >           + r(i1,  i2+1,i3-1) + r(i1,  i2+1,i3+1)
+              x2 = r(i1,  i2-1,i3  ) + r(i1,  i2+1,i3  )
+     >           + r(i1,  i2,  i3-1) + r(i1,  i2,  i3+1)
+              s(j1,j2,j3) =
+     >               0.5D0 * r(i1,i2,i3)
+     >             + 0.25D0 * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2)
+     >             + 0.125D0 * ( x1(i1-1) + x1(i1+1) + y2)
+     >             + 0.0625D0 * ( y1(i1-1) + y1(i1+1) )
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_rprj3)
+
+
+      j = k-1
+      call comm3(s,m1j,m2j,m3j,j)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(s,m1j,m2j,m3j,'   rprj3',k-1)
+      endif
+
+      if( debug_vec(4) .ge. k )then
+         call showall(s,m1j,m2j,m3j)
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine interp( z,mm1,mm2,mm3,u,n1,n2,n3,k )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     interp adds the trilinear interpolation of the correction
+c     from the coarser grid to the current approximation:  u = u + Qu'
+c     
+c     Observe that this  implementation costs  16A + 4M, where
+c     A and M denote the costs of Addition and Multiplication.  
+c     Note that this vectorizes, and is also fine for cache 
+c     based machines.  Vector machines may get slightly better 
+c     performance however, with 8 separate "do i1" loops, rather than 4.
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer mm1, mm2, mm3, n1, n2, n3,k
+      double precision z(mm1,mm2,mm3),u(n1,n2,n3)
+      integer i3, i2, i1, d1, d2, d3, t1, t2, t3
+
+c note that m = 1037 in globals.h but for this only need to be
+c 535 to handle up to 1024^3
+c      integer m
+c      parameter( m=535 )
+      double precision z1(m),z2(m),z3(m)
+
+      if (timeron) call timer_start(T_interp)
+      if( n1 .ne. 3 .and. n2 .ne. 3 .and. n3 .ne. 3 ) then
+
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+
+               do i1=1,mm1
+                  z1(i1) = z(i1,i2+1,i3) + z(i1,i2,i3)
+                  z2(i1) = z(i1,i2,i3+1) + z(i1,i2,i3)
+                  z3(i1) = z(i1,i2+1,i3+1) + z(i1,i2,i3+1) + z1(i1)
+               enddo
+
+               do  i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3-1)=u(2*i1-1,2*i2-1,2*i3-1)
+     >                 +z(i1,i2,i3)
+                  u(2*i1,2*i2-1,2*i3-1)=u(2*i1,2*i2-1,2*i3-1)
+     >                 +0.5d0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3-1)=u(2*i1-1,2*i2,2*i3-1)
+     >                 +0.5d0 * z1(i1)
+                  u(2*i1,2*i2,2*i3-1)=u(2*i1,2*i2,2*i3-1)
+     >                 +0.25d0*( z1(i1) + z1(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3)=u(2*i1-1,2*i2-1,2*i3)
+     >                 +0.5d0 * z2(i1)
+                  u(2*i1,2*i2-1,2*i3)=u(2*i1,2*i2-1,2*i3)
+     >                 +0.25d0*( z2(i1) + z2(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3)=u(2*i1-1,2*i2,2*i3)
+     >                 +0.25d0* z3(i1)
+                  u(2*i1,2*i2,2*i3)=u(2*i1,2*i2,2*i3)
+     >                 +0.125d0*( z3(i1) + z3(i1+1) )
+               enddo
+            enddo
+         enddo
+
+      else
+
+         if(n1.eq.3)then
+            d1 = 2
+            t1 = 1
+         else
+            d1 = 1
+            t1 = 0
+         endif
+         
+         if(n2.eq.3)then
+            d2 = 2
+            t2 = 1
+         else
+            d2 = 1
+            t2 = 0
+         endif
+         
+         if(n3.eq.3)then
+            d3 = 2
+            t3 = 1
+         else
+            d3 = 1
+            t3 = 0
+         endif
+         
+         do  i3=d3,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-d3)=u(2*i1-d1,2*i2-d2,2*i3-d3)
+     >                 +z(i1,i2,i3)
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-d3)=u(2*i1-t1,2*i2-d2,2*i3-d3)
+     >                 +0.5D0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-d3)=u(2*i1-d1,2*i2-t2,2*i3-d3)
+     >                 +0.5D0*(z(i1,i2+1,i3)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-d3)=u(2*i1-t1,2*i2-t2,2*i3-d3)
+     >                 +0.25D0*(z(i1+1,i2+1,i3)+z(i1+1,i2,i3)
+     >                 +z(i1,  i2+1,i3)+z(i1,  i2,i3))
+               enddo
+            enddo
+         enddo
+
+         do  i3=1,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-t3)=u(2*i1-d1,2*i2-d2,2*i3-t3)
+     >                 +0.5D0*(z(i1,i2,i3+1)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-t3)=u(2*i1-t1,2*i2-d2,2*i3-t3)
+     >                 +0.25D0*(z(i1+1,i2,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1+1,i2,i3  )+z(i1,i2,i3  ))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-t3)=u(2*i1-d1,2*i2-t2,2*i3-t3)
+     >                 +0.25D0*(z(i1,i2+1,i3+1)+z(i1,i2,i3+1)
+     >                 +z(i1,i2+1,i3  )+z(i1,i2,i3  ))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-t3)=u(2*i1-t1,2*i2-t2,2*i3-t3)
+     >                 +0.125D0*(z(i1+1,i2+1,i3+1)+z(i1+1,i2,i3+1)
+     >                 +z(i1  ,i2+1,i3+1)+z(i1  ,i2,i3+1)
+     >                 +z(i1+1,i2+1,i3  )+z(i1+1,i2,i3  )
+     >                 +z(i1  ,i2+1,i3  )+z(i1  ,i2,i3  ))
+               enddo
+            enddo
+         enddo
+
+      endif
+      if (timeron) call timer_stop(T_interp)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(z,mm1,mm2,mm3,'z: inter',k-1)
+         call rep_nrm(u,n1,n2,n3,'u: inter',k)
+      endif
+
+      if( debug_vec(5) .ge. k )then
+         call showall(z,mm1,mm2,mm3)
+         call showall(u,n1,n2,n3)
+      endif
+
+      return 
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine norm2u3(r,n1,n2,n3,rnm2,rnmu,nx,ny,nz)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     norm2u3 evaluates approximations to the L2 norm and the
+c     uniform (or L-infinity or Chebyshev) norm, under the
+c     assumption that the boundaries are periodic or zero.  Add the
+c     boundaries in with half weight (quarter weight on the edges
+c     and eighth weight at the corners) for inhomogeneous boundaries.
+c---------------------------------------------------------------------
+      implicit none
+
+
+      integer n1, n2, n3, nx, ny, nz
+      double precision rnm2, rnmu, r(n1,n2,n3)
+      double precision s, a
+      integer i3, i2, i1
+
+      double precision dn
+
+      logical timeron
+      common /timers/ timeron
+      integer T_norm2
+      parameter (T_norm2=9)
+
+      if (timeron) call timer_start(T_norm2)
+      dn = 1.0d0*nx*ny*nz
+
+      s=0.0D0
+      rnmu = 0.0D0
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               s=s+r(i1,i2,i3)**2
+               a=abs(r(i1,i2,i3))
+               if(a.gt.rnmu)rnmu=a
+            enddo
+         enddo
+      enddo
+
+      rnm2=sqrt( s / dn )
+      if (timeron) call timer_stop(T_norm2)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine rep_nrm(u,n1,n2,n3,title,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     report on norm
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      character*8 title
+
+      double precision rnm2, rnmu
+
+
+      call norm2u3(u,n1,n2,n3,rnm2,rnmu,nx(kk),ny(kk),nz(kk))
+      write(*,7)kk,title,rnm2,rnmu
+ 7    format(' Level',i2,' in ',a8,': norms =',D21.14,D21.14)
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine comm3(u,n1,n2,n3,kk)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     comm3 organizes the communication on all borders 
+c---------------------------------------------------------------------
+      implicit none
+
+      include 'globals.h'
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer i1, i2, i3
+
+      if (timeron) call timer_start(T_comm3)
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            u( 1,i2,i3) = u(n1-1,i2,i3)
+            u(n1,i2,i3) = u(   2,i2,i3)
+         enddo
+      enddo
+
+      do  i3=2,n3-1
+         do  i1=1,n1
+            u(i1, 1,i3) = u(i1,n2-1,i3)
+            u(i1,n2,i3) = u(i1,   2,i3)
+         enddo
+      enddo
+
+      do  i2=1,n2
+         do  i1=1,n1
+            u(i1,i2, 1) = u(i1,i2,n3-1)
+            u(i1,i2,n3) = u(i1,i2,   2)
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_comm3)
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zran3(z,n1,n2,n3,nx,ny,k)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     zran3  loads +1 at ten randomly chosen points,
+c     loads -1 at a different ten random points,
+c     and zero elsewhere.
+c---------------------------------------------------------------------
+      implicit none
+
+
+      integer  is1, is2, is3, ie1, ie2, ie3
+      common /grid/ is1,is2,is3,ie1,ie2,ie3
+
+      integer n1, n2, n3, k, nx, ny, i0, m0, m1
+      double precision z(n1,n2,n3)
+
+      integer mm, i1, i2, i3, d1, e1, e2, e3
+      double precision x, a
+      double precision xx, x0, x1, a1, a2, ai, power
+      parameter( mm = 10,  a = 5.D0 ** 13, x = 314159265.D0)
+      double precision ten( mm, 0:1 ), best
+      integer i, j1( mm, 0:1 ), j2( mm, 0:1 ), j3( mm, 0:1 )
+      integer jg( 0:3, mm, 0:1 )
+
+      external randlc
+      double precision randlc, rdummy
+
+      a1 = power( a, nx )
+      a2 = power( a, nx*ny )
+
+      call zero3(z,n1,n2,n3)
+
+      i = is1-2+nx*(is2-2+ny*(is3-2))
+
+      ai = power( a, i )
+      d1 = ie1 - is1 + 1
+      e1 = ie1 - is1 + 2
+      e2 = ie2 - is2 + 2
+      e3 = ie3 - is3 + 2
+      x0 = x
+      rdummy = randlc( x0, ai )
+      do  i3 = 2, e3
+         x1 = x0
+         do  i2 = 2, e2
+            xx = x1
+            call vranlc( d1, xx, a, z( 2, i2, i3 ))
+            rdummy = randlc( x1, a1 )
+         enddo
+         rdummy = randlc( x0, a2 )
+      enddo
+
+c---------------------------------------------------------------------
+c       call comm3(z,n1,n2,n3)
+c       call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     each processor looks for twenty candidates
+c---------------------------------------------------------------------
+      do  i=1,mm
+         ten( i, 1 ) = 0.0D0
+         j1( i, 1 ) = 0
+         j2( i, 1 ) = 0
+         j3( i, 1 ) = 0
+         ten( i, 0 ) = 1.0D0
+         j1( i, 0 ) = 0
+         j2( i, 0 ) = 0
+         j3( i, 0 ) = 0
+      enddo
+
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               if( z(i1,i2,i3) .gt. ten( 1, 1 ) )then
+                  ten(1,1) = z(i1,i2,i3) 
+                  j1(1,1) = i1
+                  j2(1,1) = i2
+                  j3(1,1) = i3
+                  call bubble( ten, j1, j2, j3, mm, 1 )
+               endif
+               if( z(i1,i2,i3) .lt. ten( 1, 0 ) )then
+                  ten(1,0) = z(i1,i2,i3) 
+                  j1(1,0) = i1
+                  j2(1,0) = i2
+                  j3(1,0) = i3
+                  call bubble( ten, j1, j2, j3, mm, 0 )
+               endif
+            enddo
+         enddo
+      enddo
+
+
+c---------------------------------------------------------------------
+c     Now which of these are globally best?
+c---------------------------------------------------------------------
+      i1 = mm
+      i0 = mm
+      do  i=mm,1,-1
+
+         best = 0.d0
+         if(best .lt. ten( i1, 1 ))then
+            jg( 0, i, 1) = 0
+            jg( 1, i, 1) = is1 - 2 + j1( i1, 1 ) 
+            jg( 2, i, 1) = is2 - 2 + j2( i1, 1 ) 
+            jg( 3, i, 1) = is3 - 2 + j3( i1, 1 ) 
+            i1 = i1-1
+         else
+            jg( 0, i, 1) = 0
+            jg( 1, i, 1) = 0
+            jg( 2, i, 1) = 0
+            jg( 3, i, 1) = 0
+         endif
+
+         best = 1.d0
+         if(best .gt. ten( i0, 0 ))then
+            jg( 0, i, 0) = 0
+            jg( 1, i, 0) = is1 - 2 + j1( i0, 0 ) 
+            jg( 2, i, 0) = is2 - 2 + j2( i0, 0 ) 
+            jg( 3, i, 0) = is3 - 2 + j3( i0, 0 ) 
+            i0 = i0-1
+         else
+            jg( 0, i, 0) = 0
+            jg( 1, i, 0) = 0
+            jg( 2, i, 0) = 0
+            jg( 3, i, 0) = 0
+         endif
+
+      enddo
+c      m1 = i1+1
+c      m0 = i0+1
+      m1 = 1
+      m0 = 1
+
+c     write(*,*)' '
+c     write(*,*)' negative charges at'
+c     write(*,9)(jg(1,i,0),jg(2,i,0),jg(3,i,0),i=1,mm)
+c     write(*,*)' positive charges at'
+c     write(*,9)(jg(1,i,1),jg(2,i,1),jg(3,i,1),i=1,mm)
+c     write(*,*)' small random numbers were'
+c     write(*,8)(ten( i,0),i=mm,1,-1)
+c     write(*,*)' and they were found on processor number'
+c     write(*,7)(jg(0,i,0),i=mm,1,-1)
+c     write(*,*)' large random numbers were'
+c     write(*,8)(ten( i,1),i=mm,1,-1)
+c     write(*,*)' and they were found on processor number'
+c     write(*,7)(jg(0,i,1),i=mm,1,-1)
+c 9    format(5(' (',i3,2(',',i3),')'))
+c 8    format(5D15.8)
+c 7    format(10i4)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3) = 0.0D0
+            enddo
+         enddo
+      enddo
+      do  i=mm,m0,-1
+         z( jg(1,i,0), jg(2,i,0), jg(3,i,0) ) = -1.0D0
+      enddo
+      do  i=mm,m1,-1
+         z( jg(1,i,1), jg(2,i,1), jg(3,i,1) ) = +1.0D0
+      enddo
+      call comm3(z,n1,n2,n3,k)
+
+c---------------------------------------------------------------------
+c          call showall(z,n1,n2,n3)
+c---------------------------------------------------------------------
+
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine showall(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1,n2,n3,i1,i2,i3
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i3=1,m3
+         do  i1=1,m1
+            write(*,6)(z(i1,i2,i3),i2=1,m2)
+         enddo
+         write(*,*)' - - - - - - - '
+      enddo
+      write(*,*)'  '
+ 6    format(15f6.3)
+
+      return 
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function power( a, n )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     power  raises an integer, disguised as a double
+c     precision real, to an integer power
+c---------------------------------------------------------------------
+      implicit none
+
+      double precision a, aj
+      integer n, nj
+      external randlc
+      double precision randlc, rdummy
+
+      power = 1.0D0
+      nj = n
+      aj = a
+ 100  continue
+
+      if( nj .eq. 0 ) goto 200
+      if( mod(nj,2) .eq. 1 ) rdummy =  randlc( power, aj )
+      rdummy = randlc( aj, aj )
+      nj = nj/2
+      go to 100
+
+ 200  continue
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine bubble( ten, j1, j2, j3, m, ind )
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c     bubble        does a bubble sort in direction dir
+c---------------------------------------------------------------------
+      implicit none
+
+
+      integer m, ind, j1( m, 0:1 ), j2( m, 0:1 ), j3( m, 0:1 )
+      double precision ten( m, 0:1 )
+      double precision temp
+      integer i, j_temp
+
+      if( ind .eq. 1 )then
+
+         do  i=1,m-1
+            if( ten(i,ind) .gt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      else
+
+         do  i=1,m-1
+            if( ten(i,ind) .lt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      endif
+
+      return
+      end
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine zero3(z,n1,n2,n3)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1, n2, n3
+      double precision z(n1,n2,n3)
+      integer i1, i2, i3
+
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3)=0.0D0
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+c----- end of program ------------------------------------------------
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.input.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.input.sample
new file mode 100644
index 0000000..a4dcf81
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/MG/mg.input.sample
@@ -0,0 +1,4 @@
+ 8 = top level
+ 256 256 256 = nx ny nz
+ 20 = nit
+ 0 0 0 0 0 0 0 0 = debug_vec
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/Makefile
new file mode 100644
index 0000000..820fb43
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/Makefile
@@ -0,0 +1,72 @@
+SHELL=/bin/sh
+CLASS=W
+VERSION=
+SFILE=config/suite.def
+
+default: header
+	@ sys/print_instructions
+
+BT: bt
+bt: header
+	cd BT; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+SP: sp
+sp: header
+	cd SP; $(MAKE) CLASS=$(CLASS)
+
+LU: lu
+lu: header
+	cd LU; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+MG: mg
+mg: header
+	cd MG; $(MAKE) CLASS=$(CLASS)
+
+FT: ft
+ft: header
+	cd FT; $(MAKE) CLASS=$(CLASS)
+
+IS: is
+is: header
+	cd IS; $(MAKE) CLASS=$(CLASS)
+
+CG: cg
+cg: header
+	cd CG; $(MAKE) CLASS=$(CLASS)
+
+EP: ep
+ep: header
+	cd EP; $(MAKE) CLASS=$(CLASS)
+
+UA: ua
+ua: header	       
+	cd UA; $(MAKE) CLASS=$(CLASS)
+
+DC: dc
+dc: header	       
+	cd DC; $(MAKE) CLASS=$(CLASS)
+
+# Awk script courtesy cmg@cray.com, modified by Haoqiang Jin
+suite:
+	@ awk -f sys/suite.awk SMAKE=$(MAKE) $(SFILE) | $(SHELL)
+
+
+# It would be nice to make clean in each subdirectory (the targets
+# are defined) but on a really clean system this won't work
+# because those makefiles need config/make.def
+clean:
+	- rm -f core 
+	- rm -f *~ */core */*~ */*.o */npbparams.h */*.obj */*.exe
+	- rm -f sys/setparams sys/makesuite sys/setparams.h
+	- rm -f {DC/,}ADC.{logf,view,dat,viewsz,groupby,chunks}.* 
+
+veryclean: clean
+	- rm -f config/make.def config/suite.def 
+	- rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.*
+	- rm -f bin/ep.* bin/cg.* bin/ua.* bin/dc.*
+
+header:
+	@ sys/print_header
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README
new file mode 100644
index 0000000..255a2a2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README
@@ -0,0 +1,72 @@
+NAS Parallel Benchmarks 3.3 - Serial version (NPB3.3-SER)
+---------------------------------------------------------
+
+   NAS Parallel Benchmarks Team
+   npb@nas.nasa.gov
+
+
+This directory contains the serial implementation of the NAS
+Parallel Benchmarks, Version 3.3 (NPB3.3-SER).  A brief
+summary of the new features introduced in this version is
+given below.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+For explanation of compilation and running of the benchmarks,
+please refer to README.install.
+
+
+This version (3.3) introduces a new problem size (class E) to seven 
+of the benchmarks (BT, SP, LU, CG, MG, FT, and EP). The version 
+also includes the class D problem size for the IS benchmark.
+
+The release is merged with the vector codes for the BT and
+LU benchmarks, which can be selected with the VERSION=VEC option 
+during compilation.  However, successful vectorization highly 
+depends on the compiler used.  Some changes to compiler directives 
+for vectorization in the current codes (see *_vec.f files)
+may be required.
+
+
+Main changes in NPB3.3-SER:
+
+   - Introduction of the Class E problem size (except for IS, UA, DC)
+
+   - Include the Class D problem size for the IS benchmark.
+     The "Bucket" option is now the default.
+
+   - Merged with the vector codes for the BT and LU benchmarks.
+
+   - LU-HP is no longer included in the distribution.
+
+Main changes in NPB3.2-SER:
+
+   - Convert C++ version of DC to plain C.
+
+Main changes in NPB3.1-SER:
+
+   - Include the Class D problem size in all benchmarks except for
+     the IS benchmark.
+
+   - Redefine the Class W problem size for MG to avoid too fast
+     convergence.  The new size is 128x128x128, 4 iterations.
+
+   - Use relative errors for verification in MG and CG, which is
+     consistent with other benchmarks.
+
+   - Include one SSOR iteration before the time step loop in both
+     LU and LU-HP to touch all pages.
+
+   - Include the UA benchmark
+
+   - Include the DC benchmark
+
+The serial version of NPB3.0 (NPB3.0-SER) is based on NPB2.3-serial
+with the following improvements:
+
+   - memory optimization for BT and SP
+
+   - two implementations included for LU
+
+   - restructured FT
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README.install b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README.install
new file mode 100644
index 0000000..a3d79df
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README.install
@@ -0,0 +1,203 @@
+Some explanations on NAS Parallel Benchmarks 3.3 - Serial version
+-----------------------------------------------------------------
+
+The serial version of NPB3.x (NPB3.x-SER) is based on NPB2.3-serial
+with a number of improvements (see Section 3 below) and added with
+two new benchmarks (UA and DC).
+
+For problem reports and suggestions on the implementation, please contact
+
+   NAS Parallel Benchmarks Team
+   npb@nas.nasa.gov
+
+
+1. Compilation
+
+   NPB3.3-SER uses the same directory tree as NPB2.3.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file as necessary.  
+   If it does not (yet) exist, copy 'make.def.template' or one of the
+   sample files in the NAS.samples subdirectory to 'make.def' and 
+   edit the content for site- and machine-specific data.  Then
+
+      make <benchmark> CLASS=<class> [VERSION=VEC]
+
+   <benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, IS, UA, DC) and 
+   <class> is one of (S, W, A, B, C).  Although Classes D and E are also 
+   defined for a number of benchmarks, the memory requirement and 
+   execution time likely exceed what most of the single processor 
+   systems can support.
+
+   Classes C, D and E are not defined for DC.
+   Class E is not defined for IS and UA.
+
+   The "VERSION=VEC" option is used for selecting the vectorized 
+   versions of BT and LU.
+
+   Class D for IS (Integer Sort) requires a compiler/system that 
+   supports the "long" type in C to be 64-bit.  As examples, the SGI 
+   MIPS compiler for the SGI Origin using the "-64" compilation flag and
+   the Intel compiler for IA64 are known to work.
+
+   In order to build the class E version of CG, the integer type
+   needs to be promoted to 64-bit, which is usually done through 
+   compilation flag (such as "-i8" for FFLAGS in config/make.def).
+
+   To build a suite of benchmarks, one can create the file 
+   "config/suite.def", which contains a list of executables to build.
+   Each line in the file contains the name of a benchmark and the class,
+   separated by spaces or tabs (see suite.def.template for an example).
+   Then
+
+      make suite
+
+
+   ================================
+   
+   The "RAND" variable in make.def
+   --------------------------------
+   
+   Most of the NPBs use a random number generator. In two of the NPBs (FT
+   and EP) the computation of random numbers is included in the timed
+   part of the calculation, and it is important that the random number
+   generator be efficient.  The default random number generator package
+   provided is called "randi8" and should be used where possible. It has 
+   the following requirements:
+   
+   randi8:
+     1. Uses integer*8 arithmetic. Compiler must support integer*8
+     2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
+     3. Assumes overflow bits are discarded by the hardware. In particular, 
+        that the lowest 46 bits of a*b are always correct, even if the 
+        result a*b is larger than 2^64. 
+   
+   Since randi8 may not work on all machines, we supply the following
+   alternatives:
+   
+   randi8_safe
+     1. Uses integer*8 arithmetic
+     2. Uses the Fortran 90 IBITS intrinsic. 
+     3. Does not make any assumptions about overflow. Should always
+        work correctly if compiler supports integer*8 and IBITS. 
+   
+   randdp
+     1. Uses double precision arithmetic (to simulate integer*8 operations). 
+        Should work with any system with support for 64-bit floating
+        point arithmetic.      
+   
+   randdpvec
+     1. Similar to randdp but written to be easier to vectorize. 
+
+
+2. Execution
+
+   The executable is named <benchmark-name>.<class>.x and is placed
+   in the bin subdirectory (or in the directory BINDIR specified in
+   make.def, if you've defined it).  NPB3.3-SER can be run as regular 
+   executables without additional settings.  For example:
+
+      bin/bt.A.x > BT.A_out
+
+   It runs BT Class A problem and the output is stored to BT.A_out.
+
+   Each benchmark includes a set of additional timers for profiling purpose
+   (reporting timing for selected code blocks).  By default, these timers
+   are disabled.  To enable the timers, create a dummy file 'timer.flag' 
+   in the current working directory (not necessarily where the executable 
+   is located) before running a benchmark.
+
+
+3. Notes on the implementation (NPB3.0-SER)
+
+3.1 BT
+
+This version is optimized for memory performance.  It uses much less
+memory than the original version due to the size reduction of working 
+arrays.
+
+Serial performance in comparison with the original NPB2.3-serial.
+----------------------------------------------------------------------
+Machine	(Speed) 	Class	NPB2.3-serial	NPB3.0-SER
+Origin2000 (250MHz)	A	2162.4(77.82)	1075.2(156.51)	50.3%
+T3E (300MHz)    	W	218.1(35.39)	117.0(65.95)	46.4%
+                	A	~5285.5(31.84)	2836.5(59.33)
+SGI R5000 (150MHz)	W	549.8(14.04)	265.0(29.13)	51.8%
+PPro (200MHz)		W	316.8(24.36)	121.2(63.69)	61.7%
+----------------------------------------------------------------------
+-- memory usage (Class A):
+      	 NPB2.3 - 323MB, PBN - 46MB
+----------------------------------------------------------------------
+
+3.2 SP
+
+This version is optimized for memory performance.  The smaller dimension
+in U and RHS was moved to the inner-most, which gives better cache
+performance.  However, the code is not as friendly to vector machines as
+the original version.
+
+Serial performance in comparison with the original NPB2.3-serial.
+----------------------------------------------------------------------
+Machine	(Speed) 	Class	NPB2.3-serial	NPB3.0-SERerial
+Origin2000 (250MHz)	A	1478.3(57.51)	971.4(87.52)	34.3%
+T3E (300MHz)    	A	3194.3(26.61)	1708.3(49.76)	46.5%
+SGI R5000 (150MHz)	W	1324.2(10.70)	774.1(18.31)	41.5%
+PPro (200MHz)		W	758.9(18.68)	449.0(31.57)	40.8%
+----------------------------------------------------------------------
+-- memory usage (Class A):
+      	 NPB2.3 - 82MB, PBN - 48MB
+----------------------------------------------------------------------
+
+3.3 LU and LU-hp
+
+LU is essentially the same as the original NPB2.3-serial.
+It is a good starting point for a pipeline implementation.
+
+LU-hp contains a hyper-plane implementation of the SSOR algorithm.
+The default version is 3-D hyper-plane and has worse cache performance
+than LU.  Six relevant routines for a 2-D hyper-plane (wave-front)
+implementation are included in the subdirectory 'ver2'.
+
+Some of the timings on a single processor:
+----------------------------------------------------------------------
+Class A                	LU      	LU-hp-3D	LU-hp-2D
+Origin2000 (250MHz)	1389.4(85.87)	1605.1(74.32)	1325.1(90.03)
+----------------------------------------------------------------------
+
+3.4 FT
+
+Summary of changes from NPB2.3-serial
+
+- Reduce the use of memory for big arrays by 1/3
+- Random number generator is made parallelizable
+
+3.5 CG, MG
+
+Except for removal of some working buffers (used in the MPI
+program), the implementation has the same structure as the
+NPB2.3-serial.
+
+3.6 EP
+
+It has the same implementation as in the original NPB2.3-serial.
+
+3.7 IS
+
+An extra array copy in the iteration loop was eliminated in the new
+version.  This improved performance by about 35% on a CLASS A problem
+on Origin2000 (195MHz).
+
+Old version (NPB2.3-serial)-
+ Time in seconds =                     9.06
+ Mop/s total     =                     9.25
+
+New version (NPB3.0-SER)-
+ Time in seconds =                     5.89
+ Mop/s total     =                    14.23
+
+
+3.8 Timers
+
+NPB3.x-SER includes additional timers in the seven Fortran
+benchmarks.  To activate these timers, create a dummy file
+'timer.flag' in the directory where the program is to run.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/Makefile
new file mode 100644
index 0000000..9ecbf08
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/Makefile
@@ -0,0 +1,55 @@
+SHELL=/bin/sh
+BENCHMARK=sp
+BENCHMARKU=SP
+
+include ../config/make.def
+
+
+OBJS = sp.o initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o rhs.o      \
+       x_solve.o ninvr.o y_solve.o pinvr.o    \
+       z_solve.o tzetar.o add.o txinvr.o error.o verify.o  \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o:
+	${FCOMPILE} $<
+
+sp.o:             sp.f  header.h npbparams.h
+initialize.o:     initialize.f  header.h npbparams.h
+exact_solution.o: exact_solution.f  header.h npbparams.h
+exact_rhs.o:      exact_rhs.f  header.h npbparams.h
+set_constants.o:  set_constants.f  header.h npbparams.h
+adi.o:            adi.f  header.h npbparams.h
+rhs.o:            rhs.f  header.h npbparams.h
+#lhsx.o:           lhsx.f  header.h npbparams.h
+#lhsy.o:           lhsy.f  header.h npbparams.h
+#lhsz.o:           lhsz.f  header.h npbparams.h
+x_solve.o:        x_solve.f  header.h npbparams.h
+ninvr.o:          ninvr.f  header.h npbparams.h
+y_solve.o:        y_solve.f  header.h npbparams.h
+pinvr.o:          pinvr.f  header.h npbparams.h
+z_solve.o:        z_solve.f  header.h npbparams.h
+tzetar.o:         tzetar.f  header.h npbparams.h
+add.o:            add.f  header.h npbparams.h
+txinvr.o:         txinvr.f  header.h npbparams.h
+error.o:          error.f  header.h npbparams.h
+verify.o:         verify.f  header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/add.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/add.f
new file mode 100644
index 0000000..bc3ad25
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/add.f
@@ -0,0 +1,32 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  add
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c addition of update to the vector u
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i,j,k,m
+
+       if (timeron) call timer_start(t_add)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+                do m = 1, 5
+                   u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_add)
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/adi.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/adi.f
new file mode 100644
index 0000000..6e46da9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/adi.f
@@ -0,0 +1,24 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  adi
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       call compute_rhs
+
+       call txinvr
+
+       call x_solve
+
+       call y_solve
+
+       call z_solve
+
+       call add
+
+       return
+       end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/error.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/error.f
new file mode 100644
index 0000000..db2de1f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/error.f
@@ -0,0 +1,86 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine error_norm(rms)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function computes the norm of the difference between the
+c computed solution and the exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, m, d
+       double precision xi, eta, zeta, u_exact(5), rms(5), add
+
+
+       do    m = 1, 5
+          rms(m) = 0.0d0
+       end do
+
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             do   i = 0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+                call exact_solution(xi, eta, zeta, u_exact)
+
+                do   m = 1, 5
+                   add = u(m,i,j,k)-u_exact(m)
+                   rms(m) = rms(m) + add*add
+                end do
+             end do
+          end do
+       end do
+
+       do    m = 1, 5
+          do    d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
+
+       subroutine rhs_norm(rms)
+
+       include 'header.h'
+
+       integer i, j, k, d, m
+       double precision rms(5), add
+
+
+       do   m = 1, 5
+          rms(m) = 0.0d0
+       end do
+
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+               do m = 1, 5
+                  add = rhs(m,i,j,k)
+                  rms(m) = rms(m) + add*add
+               end do 
+             end do 
+          end do 
+       end do 
+
+       do   m = 1, 5
+          do   d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_rhs.f
new file mode 100644
index 0000000..4939942
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_rhs.f
@@ -0,0 +1,344 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine exact_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c compute the right hand side based on exact solution
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision dtemp(5), xi, eta, zeta, dtpp
+       integer          m, i, j, k, ip1, im1, jp1, 
+     >                  jm1, km1, kp1
+
+c---------------------------------------------------------------------
+c      initialize                                  
+c---------------------------------------------------------------------
+       do   k= 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+             do   i = 0, grid_points(1)-1
+                do   m = 1, 5
+                   forcing(m,i,j,k) = 0.0d0
+                end do
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      xi-direction flux differences                      
+c---------------------------------------------------------------------
+       do   k = 1, grid_points(3)-2
+          zeta = dble(k) * dnzm1
+          do   j = 1, grid_points(2)-2
+             eta = dble(j) * dnym1
+
+             do  i=0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do  m = 1, 5
+                   ue(i,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0 / dtemp(1)
+
+                do  m = 2, 5
+                   buf(i,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(i)   = buf(i,2) * buf(i,2)
+                buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + 
+     >                     buf(i,4) * buf(i,4) 
+                q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +
+     >                        buf(i,4)*ue(i,4))
+
+             end do
+ 
+             do  i = 1, grid_points(1)-2
+                im1 = i-1
+                ip1 = i+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tx2*( ue(ip1,2)-ue(im1,2) )+
+     >                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (
+     >                (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-
+     >                (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+
+     >                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+
+     >                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (
+     >                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+
+     >                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(
+     >                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+
+     >                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+
+     >                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(
+     >                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-
+     >                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+
+     >                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+
+     >                               buf(im1,1))+
+     >                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+
+     >                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+
+     >                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation                         
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                i = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                i = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -
+     >                     4.0d0*ue(i+1,m) +       ue(i+2,m))
+             end do
+
+             do   m = 1, 5
+                do  i = 3, grid_points(1)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                i = grid_points(1)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +
+     >                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                i = grid_points(1)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+             end do
+
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c  eta-direction flux differences             
+c---------------------------------------------------------------------
+       do   k = 1, grid_points(3)-2          
+          zeta = dble(k) * dnzm1
+          do   i=1, grid_points(1)-2
+             xi = dble(i) * dnxm1
+
+             do  j=0, grid_points(2)-1
+                eta = dble(j) * dnym1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5 
+                   ue(j,m) = dtemp(m)
+                end do
+                dtpp = 1.0d0/dtemp(1)
+
+                do  m = 2, 5
+                   buf(j,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(j)   = buf(j,3) * buf(j,3)
+                buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + 
+     >                     buf(j,4) * buf(j,4)
+                q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +
+     >                        buf(j,4)*ue(j,4))
+             end do
+
+             do  j = 1, grid_points(2)-2
+                jm1 = j-1
+                jp1 = j+1
+                  
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                ty2*( ue(jp1,3)-ue(jm1,3) )+
+     >                dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(
+     >                ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+
+     >                dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(
+     >                (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-
+     >                (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+
+     >                yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+
+     >                dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(
+     >                ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+
+     >                yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+
+     >                dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(
+     >                buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-
+     >                buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+
+     >                0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+
+     >                              buf(jm1,1))+
+     >                yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+
+     >                yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+
+     >                dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation                      
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                j = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                j = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -
+     >                     4.0d0*ue(j+1,m) +       ue(j+2,m))
+             end do
+
+             do   m = 1, 5
+                do  j = 3, grid_points(2)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                j = grid_points(2)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +
+     >                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                j = grid_points(2)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+             end do
+
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      zeta-direction flux differences                      
+c---------------------------------------------------------------------
+       do  j=1, grid_points(2)-2
+          eta = dble(j) * dnym1
+          do   i = 1, grid_points(1)-2
+             xi = dble(i) * dnxm1
+
+             do k=0, grid_points(3)-1
+                zeta = dble(k) * dnzm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5
+                   ue(k,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0/dtemp(1)
+
+                do   m = 2, 5
+                   buf(k,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(k)   = buf(k,4) * buf(k,4)
+                buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +
+     >                     buf(k,3) * buf(k,3)
+                q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +
+     >                        buf(k,4)*ue(k,4))
+             end do
+
+             do    k=1, grid_points(3)-2
+                km1 = k-1
+                kp1 = k+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -
+     >                 tz2*( ue(kp1,4)-ue(km1,4) )+
+     >                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (
+     >                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+
+     >                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (
+     >                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+
+     >                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+
+     >                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (
+     >                (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-
+     >                (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+
+     >                zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+
+     >                dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (
+     >                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-
+     >                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+
+     >                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)
+     >                              +buf(km1,1))+
+     >                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+
+     >                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+
+     >                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+             end do
+
+c---------------------------------------------------------------------
+c            Fourth-order dissipation
+c---------------------------------------------------------------------
+             do   m = 1, 5
+                k = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                k = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -
+     >                     4.0d0*ue(k+1,m) +       ue(k+2,m))
+             end do
+
+             do   m = 1, 5
+                do  k = 3, grid_points(3)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                end do
+             end do
+
+             do    m = 1, 5
+                k = grid_points(3)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +
+     >                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                   k = grid_points(3)-2
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *
+     >                   (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                end do
+
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c now change the sign of the forcing function, 
+c---------------------------------------------------------------------
+       do   k = 1, grid_points(3)-2
+          do   j = 1, grid_points(2)-2
+             do   i = 1, grid_points(1)-2
+                do   m = 1, 5
+                   forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_solution.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_solution.f
new file mode 100644
index 0000000..abd31bb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/exact_solution.f
@@ -0,0 +1,30 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function returns the exact solution at point xi, eta, zeta  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       double precision  xi, eta, zeta, dtemp(5)
+       integer m
+
+       do  m = 1, 5
+          dtemp(m) =  ce(m,1) +
+     >    xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +
+     >    eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+
+     >    zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + 
+     >    zeta*ce(m,13))))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/header.h
new file mode 100644
index 0000000..90b658f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/header.h
@@ -0,0 +1,109 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+c---------------------------------------------------------------------
+c The following include file is generated automatically by the
+c "setparams" utility. It defines 
+c      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+c      dt_default:    default time step for this problem size if no
+c                     config file
+c      niter_default: default number of iterations for this problem size
+c---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           grid_points(3), nx2, ny2, nz2
+      logical           timeron
+      common /global/   grid_points, nx2, ny2, nz2, timeron
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, 
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,
+     >                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, 
+     >                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, 
+     >                  ce, dxmax, dymax, dzmax, xxcon1, xxcon2, 
+     >                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,
+     >                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,
+     >                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,
+     >                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, 
+     >                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, 
+     >                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, 
+     >                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,
+     >                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, 
+     >                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, 
+     >                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX, IMAXP, JMAXP
+
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+c---------------------------------------------------------------------
+c   To improve cache performance, grid dimensions padded by 1 
+c   for even number sizes only
+c---------------------------------------------------------------------
+      double precision 
+     >   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   speed   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),
+     >   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1)
+
+      common /fields/  u, us, vs, ws, qs, rho_i, speed, square, 
+     >                 rhs, forcing
+
+      double precision cv(0:problem_size-1),   rhon(0:problem_size-1),
+     >                 rhos(0:problem_size-1), rhoq(0:problem_size-1),
+     >                 cuf(0:problem_size-1),  q(0:problem_size-1),
+     >                 ue(0:problem_size-1,5), buf(0:problem_size-1,5)
+      common /work_1d/ cv, rhon, rhos, rhoq, cuf, q, ue, buf
+
+      double precision lhs (5,0:IMAXP,0:IMAXP),
+     >                 lhsp(5,0:IMAXP,0:IMAXP),
+     >                 lhsm(5,0:IMAXP,0:IMAXP)
+      common /work_lhs/ lhs, lhsp, lhsm
+
+c-----------------------------------------------------------------------
+c   Timer constants
+c-----------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_xsolve,t_ysolve,t_zsolve,
+     >        t_rdis1,t_rdis2,t_tzetar,t_ninvr,t_pinvr,t_add,
+     >        t_rhs,t_txinvr,t_last,t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_rdis2 = 10)
+      parameter (t_txinvr = 11)
+      parameter (t_pinvr = 12)
+      parameter (t_ninvr = 13)
+      parameter (t_tzetar = 14)
+      parameter (t_add = 15)
+      parameter (t_last = 15)
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/initialize.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/initialize.f
new file mode 100644
index 0000000..669693a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/initialize.f
@@ -0,0 +1,261 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  initialize
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c This subroutine initializes the field variable u using 
+c tri-linear transfinite interpolation of the boundary values     
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       integer i, j, k, m, ix, iy, iz
+       double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta, 
+     >                   Pzeta, temp(5)
+    
+c---------------------------------------------------------------------
+c  Later (in compute_rhs) we compute 1/u for every element. A few of 
+c  the corner elements are not used, but it convenient (and faster) 
+c  to compute the whole thing with a simple loop. Make sure those 
+c  values are nonzero by initializing the whole thing here. 
+c---------------------------------------------------------------------
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               u(1,i,j,k) = 1.0
+               u(2,i,j,k) = 0.0
+               u(3,i,j,k) = 0.0
+               u(4,i,j,k) = 0.0
+               u(5,i,j,k) = 1.0
+            end do
+         end do
+      end do
+
+c---------------------------------------------------------------------
+c first store the "interpolated" values everywhere on the grid    
+c---------------------------------------------------------------------
+          do  k = 0, grid_points(3)-1
+             zeta = dble(k) * dnzm1
+             do  j = 0, grid_points(2)-1
+                eta = dble(j) * dnym1
+                do   i = 0, grid_points(1)-1
+                   xi = dble(i) * dnxm1
+                  
+                   do ix = 1, 2
+                      Pxi = dble(ix-1)
+                      call exact_solution(Pxi, eta, zeta, 
+     >                                    Pface(1,1,ix))
+                   end do
+
+                   do    iy = 1, 2
+                      Peta = dble(iy-1)
+                      call exact_solution(xi, Peta, zeta, 
+     >                                    Pface(1,2,iy))
+                   end do
+
+                   do    iz = 1, 2
+                      Pzeta = dble(iz-1)
+                      call exact_solution(xi, eta, Pzeta,   
+     >                                    Pface(1,3,iz))
+                   end do
+
+                   do   m = 1, 5
+                      Pxi   = xi   * Pface(m,1,2) + 
+     >                        (1.0d0-xi)   * Pface(m,1,1)
+                      Peta  = eta  * Pface(m,2,2) + 
+     >                        (1.0d0-eta)  * Pface(m,2,1)
+                      Pzeta = zeta * Pface(m,3,2) + 
+     >                        (1.0d0-zeta) * Pface(m,3,1)
+ 
+                      u(m,i,j,k) = Pxi + Peta + Pzeta - 
+     >                          Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + 
+     >                          Pxi*Peta*Pzeta
+
+                   end do
+                end do
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c now store the exact values on the boundaries        
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c west face                                                  
+c---------------------------------------------------------------------
+
+       xi = 0.0d0
+       i  = 0
+       do  k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c east face                                                      
+c---------------------------------------------------------------------
+
+       xi = 1.0d0
+       i  = grid_points(1)-1
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   j = 0, grid_points(2)-1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c south face                                                 
+c---------------------------------------------------------------------
+
+       eta = 0.0d0
+       j   = 0
+       do  k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   i = 0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+
+c---------------------------------------------------------------------
+c north face                                    
+c---------------------------------------------------------------------
+
+       eta = 1.0d0
+       j   = grid_points(2)-1
+       do   k = 0, grid_points(3)-1
+          zeta = dble(k) * dnzm1
+          do   i = 0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c bottom face                                       
+c---------------------------------------------------------------------
+
+       zeta = 0.0d0
+       k    = 0
+       do   j = 0, grid_points(2)-1
+          eta = dble(j) * dnym1
+          do   i =0, grid_points(1)-1
+             xi = dble(i) *dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c top face     
+c---------------------------------------------------------------------
+
+       zeta = 1.0d0
+       k    = grid_points(3)-1
+       do   j = 0, grid_points(2)-1
+          eta = dble(j) * dnym1
+          do   i =0, grid_points(1)-1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+       subroutine lhsinit(ni, nj)
+
+       include 'header.h'
+
+       integer ni, nj
+
+       integer j, m
+
+c---------------------------------------------------------------------
+c     zap the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+       do j = 1, nj
+          do   m = 1, 5
+             lhs (m,0,j) = 0.0d0
+             lhsp(m,0,j) = 0.0d0
+             lhsm(m,0,j) = 0.0d0
+             lhs (m,ni,j) = 0.0d0
+             lhsp(m,ni,j) = 0.0d0
+             lhsm(m,ni,j) = 0.0d0
+          end do
+          lhs (3,0,j) = 1.0d0
+          lhsp(3,0,j) = 1.0d0
+          lhsm(3,0,j) = 1.0d0
+          lhs (3,ni,j) = 1.0d0
+          lhsp(3,ni,j) = 1.0d0
+          lhsm(3,ni,j) = 1.0d0
+       end do
+ 
+       return
+       end
+
+
+       subroutine lhsinitj(nj, ni)
+
+       include 'header.h'
+
+       integer nj, ni
+
+       integer i, m
+
+c---------------------------------------------------------------------
+c     zap the whole left hand side for starters
+c     set all diagonal values to 1. This is overkill, but convenient
+c---------------------------------------------------------------------
+       do i = 1, ni
+          do   m = 1, 5
+             lhs (m,i,0) = 0.0d0
+             lhsp(m,i,0) = 0.0d0
+             lhsm(m,i,0) = 0.0d0
+             lhs (m,i,nj) = 0.0d0
+             lhsp(m,i,nj) = 0.0d0
+             lhsm(m,i,nj) = 0.0d0
+          end do
+          lhs (3,i,0) = 1.0d0
+          lhsp(3,i,0) = 1.0d0
+          lhsm(3,i,0) = 1.0d0
+          lhs (3,i,nj) = 1.0d0
+          lhsp(3,i,nj) = 1.0d0
+          lhsm(3,i,nj) = 1.0d0
+       end do
+ 
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/inputsp.data.sample b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/inputsp.data.sample
new file mode 100644
index 0000000..ae3801f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/inputsp.data.sample
@@ -0,0 +1,3 @@
+400       number of time steps
+0.0015d0  dt for class A = 0.0015d0. class B = 0.001d0  class C = 0.00067d0
+64 64 64
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/ninvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/ninvr.f
new file mode 100644
index 0000000..1967bf5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/ninvr.f
@@ -0,0 +1,44 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  ninvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication              
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer  i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_ninvr)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+               
+                t1 = bt * r3
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) = -r2
+                rhs(2,i,j,k) =  r1
+                rhs(3,i,j,k) = bt * ( r4 - r5 )
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             enddo    
+          enddo
+       enddo
+       if (timeron) call timer_stop(t_ninvr)
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/pinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/pinvr.f
new file mode 100644
index 0000000..56862e8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/pinvr.f
@@ -0,0 +1,47 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine pinvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_pinvr)
+       do   k = 1, nz2
+          do   j = 1, ny2
+             do   i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = bt * r1
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) =  bt * ( r4 - r5 )
+                rhs(2,i,j,k) = -r3
+                rhs(3,i,j,k) =  r2
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_pinvr)
+
+       return
+       end
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/rhs.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/rhs.f
new file mode 100644
index 0000000..e3ea77e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/rhs.f
@@ -0,0 +1,409 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine compute_rhs
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, m
+       double precision aux, rho_inv, uijk, up1, um1, vijk, vp1, vm1,
+     >                  wijk, wp1, wm1
+
+
+       if (timeron) call timer_start(t_rhs)
+c---------------------------------------------------------------------
+c      compute the reciprocal of density, and the kinetic energy, 
+c      and the speed of sound. 
+c---------------------------------------------------------------------
+
+       do    k = 0, grid_points(3)-1
+          do    j = 0, grid_points(2)-1
+             do    i = 0, grid_points(1)-1
+                rho_inv = 1.0d0/u(1,i,j,k)
+                rho_i(i,j,k) = rho_inv
+                us(i,j,k) = u(2,i,j,k) * rho_inv
+                vs(i,j,k) = u(3,i,j,k) * rho_inv
+                ws(i,j,k) = u(4,i,j,k) * rho_inv
+                square(i,j,k)     = 0.5d0* (
+     >                        u(2,i,j,k)*u(2,i,j,k) + 
+     >                        u(3,i,j,k)*u(3,i,j,k) +
+     >                        u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+                qs(i,j,k) = square(i,j,k) * rho_inv
+c---------------------------------------------------------------------
+c               (don't need speed and ainx until the lhs computation)
+c---------------------------------------------------------------------
+                aux = c1c2*rho_inv* (u(5,i,j,k) - square(i,j,k))
+                speed(i,j,k) = dsqrt(aux)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c copy the exact forcing term to the right hand side;  because 
+c this forcing term is known, we can store it on the whole grid
+c including the boundary                   
+c---------------------------------------------------------------------
+
+       do    k = 0, grid_points(3)-1
+          do    j = 0, grid_points(2)-1
+             do    i = 0, grid_points(1)-1
+                do    m = 1, 5
+                   rhs(m,i,j,k) = forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      compute xi-direction fluxes 
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsx)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                uijk = us(i,j,k)
+                up1  = us(i+1,j,k)
+                um1  = us(i-1,j,k)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 * 
+     >                    (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) + 
+     >                     u(1,i-1,j,k)) -
+     >                    tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 * 
+     >                    (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) + 
+     >                     u(2,i-1,j,k)) +
+     >                    xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -
+     >                    tx2 * (u(2,i+1,j,k)*up1 - 
+     >                           u(2,i-1,j,k)*um1 +
+     >                           (u(5,i+1,j,k)- square(i+1,j,k)-
+     >                            u(5,i-1,j,k)+ square(i-1,j,k))*
+     >                            c2)
+
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 * 
+     >                    (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +
+     >                     u(3,i-1,j,k)) +
+     >                    xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +
+     >                              vs(i-1,j,k)) -
+     >                    tx2 * (u(3,i+1,j,k)*up1 - 
+     >                           u(3,i-1,j,k)*um1)
+
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 * 
+     >                    (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +
+     >                     u(4,i-1,j,k)) +
+     >                    xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +
+     >                              ws(i-1,j,k)) -
+     >                    tx2 * (u(4,i+1,j,k)*up1 - 
+     >                           u(4,i-1,j,k)*um1)
+
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 * 
+     >                    (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +
+     >                     u(5,i-1,j,k)) +
+     >                    xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +
+     >                              qs(i-1,j,k)) +
+     >                    xxcon4 * (up1*up1 -       2.0d0*uijk*uijk + 
+     >                              um1*um1) +
+     >                    xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) - 
+     >                              2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                              u(5,i-1,j,k)*rho_i(i-1,j,k)) -
+     >                    tx2 * ( (c1*u(5,i+1,j,k) - 
+     >                             c2*square(i+1,j,k))*up1 -
+     >                            (c1*u(5,i-1,j,k) - 
+     >                             c2*square(i-1,j,k))*um1 )
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order xi-direction dissipation               
+c---------------------------------------------------------------------
+
+          do    j = 1, ny2
+             i = 1
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +
+     >                            u(m,i+2,j,k))
+             end do
+
+             i = 2
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+             end do
+          end do
+
+          do    j = 1, ny2
+             do  i = 3, nx2-2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) + 
+     >                         u(m,i+2,j,k) )
+                end do
+             end do
+          end do
+
+          do    j = 1, ny2
+             i = nx2-1
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+             end do
+
+             i = nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_rhsx)
+
+c---------------------------------------------------------------------
+c      compute eta-direction fluxes 
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsy)
+       do     k = 1, nz2
+          do     j = 1, ny2
+             do     i = 1, nx2
+                vijk = vs(i,j,k)
+                vp1  = vs(i,j+1,k)
+                vm1  = vs(i,j-1,k)
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 * 
+     >                   (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) + 
+     >                    u(1,i,j-1,k)) -
+     >                   ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 * 
+     >                   (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) + 
+     >                    u(2,i,j-1,k)) +
+     >                   yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) + 
+     >                             us(i,j-1,k)) -
+     >                   ty2 * (u(2,i,j+1,k)*vp1 - 
+     >                          u(2,i,j-1,k)*vm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 * 
+     >                   (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) + 
+     >                    u(3,i,j-1,k)) +
+     >                   yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -
+     >                   ty2 * (u(3,i,j+1,k)*vp1 - 
+     >                          u(3,i,j-1,k)*vm1 +
+     >                          (u(5,i,j+1,k) - square(i,j+1,k) - 
+     >                           u(5,i,j-1,k) + square(i,j-1,k))
+     >                          *c2)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 * 
+     >                   (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) + 
+     >                    u(4,i,j-1,k)) +
+     >                   yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) + 
+     >                             ws(i,j-1,k)) -
+     >                   ty2 * (u(4,i,j+1,k)*vp1 - 
+     >                          u(4,i,j-1,k)*vm1)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 * 
+     >                   (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) + 
+     >                    u(5,i,j-1,k)) +
+     >                   yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) + 
+     >                             qs(i,j-1,k)) +
+     >                   yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk + 
+     >                             vm1*vm1) +
+     >                   yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) - 
+     >                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                             u(5,i,j-1,k)*rho_i(i,j-1,k)) -
+     >                   ty2 * ((c1*u(5,i,j+1,k) - 
+     >                           c2*square(i,j+1,k)) * vp1 -
+     >                          (c1*u(5,i,j-1,k) - 
+     >                           c2*square(i,j-1,k)) * vm1)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order eta-direction dissipation         
+c---------------------------------------------------------------------
+
+          j = 1
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +
+     >                            u(m,i,j+2,k))
+             end do
+          end do
+
+          j = 2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+             end do
+          end do
+
+          do    j = 3, ny2-2
+             do  i = 1,nx2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) + 
+     >                         u(m,i,j+2,k) )
+                end do
+             end do
+          end do
+ 
+          j = ny2-1
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+             end do
+          end do
+
+          j = ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_rhsy)
+
+c---------------------------------------------------------------------
+c      compute zeta-direction fluxes 
+c---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsz)
+       do    k = 1, nz2
+          do     j = 1, ny2
+             do     i = 1, nx2
+                wijk = ws(i,j,k)
+                wp1  = ws(i,j,k+1)
+                wm1  = ws(i,j,k-1)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 * 
+     >                   (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) + 
+     >                    u(1,i,j,k-1)) -
+     >                   tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 * 
+     >                   (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) + 
+     >                    u(2,i,j,k-1)) +
+     >                   zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) + 
+     >                             us(i,j,k-1)) -
+     >                   tz2 * (u(2,i,j,k+1)*wp1 - 
+     >                          u(2,i,j,k-1)*wm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 * 
+     >                   (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) + 
+     >                    u(3,i,j,k-1)) +
+     >                   zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) + 
+     >                             vs(i,j,k-1)) -
+     >                   tz2 * (u(3,i,j,k+1)*wp1 - 
+     >                          u(3,i,j,k-1)*wm1)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 * 
+     >                   (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) + 
+     >                    u(4,i,j,k-1)) +
+     >                   zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -
+     >                   tz2 * (u(4,i,j,k+1)*wp1 - 
+     >                          u(4,i,j,k-1)*wm1 +
+     >                          (u(5,i,j,k+1) - square(i,j,k+1) - 
+     >                           u(5,i,j,k-1) + square(i,j,k-1))
+     >                          *c2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 * 
+     >                   (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) + 
+     >                    u(5,i,j,k-1)) +
+     >                   zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) + 
+     >                             qs(i,j,k-1)) +
+     >                   zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + 
+     >                             wm1*wm1) +
+     >                   zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) - 
+     >                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +
+     >                             u(5,i,j,k-1)*rho_i(i,j,k-1)) -
+     >                   tz2 * ( (c1*u(5,i,j,k+1) - 
+     >                            c2*square(i,j,k+1))*wp1 -
+     >                           (c1*u(5,i,j,k-1) - 
+     >                            c2*square(i,j,k-1))*wm1)
+             end do
+          end do
+       end do
+
+c---------------------------------------------------------------------
+c      add fourth order zeta-direction dissipation                
+c---------------------------------------------------------------------
+
+       k = 1
+       do     j = 1, ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp * 
+     >                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +
+     >                            u(m,i,j,k+2))
+             end do
+          end do
+       end do
+
+       k = 2
+       do     j = 1, ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -
+     >                      4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+             end do
+          end do
+       end do
+
+       do     k = 3, nz2-2
+          do     j = 1, ny2
+             do     i = 1,nx2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp * 
+     >                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) + 
+     >                         u(m,i,j,k+2) )
+                end do
+             end do
+          end do
+       end do
+ 
+       k = nz2-1
+       do     j = 1, ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) + 
+     >                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+             end do
+          end do
+       end do
+
+       k = nz2
+       do     j = 1, ny2
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *
+     >                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +
+     >                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_rhsz)
+
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                do    m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) * dt
+                end do
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_rhs)
+    
+       return
+       end
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/set_constants.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/set_constants.f
new file mode 100644
index 0000000..63ce72b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/set_constants.f
@@ -0,0 +1,203 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  set_constants
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       include 'header.h'
+  
+       ce(1,1)  = 2.0d0
+       ce(1,2)  = 0.0d0
+       ce(1,3)  = 0.0d0
+       ce(1,4)  = 4.0d0
+       ce(1,5)  = 5.0d0
+       ce(1,6)  = 3.0d0
+       ce(1,7)  = 0.5d0
+       ce(1,8)  = 0.02d0
+       ce(1,9)  = 0.01d0
+       ce(1,10) = 0.03d0
+       ce(1,11) = 0.5d0
+       ce(1,12) = 0.4d0
+       ce(1,13) = 0.3d0
+ 
+       ce(2,1)  = 1.0d0
+       ce(2,2)  = 0.0d0
+       ce(2,3)  = 0.0d0
+       ce(2,4)  = 0.0d0
+       ce(2,5)  = 1.0d0
+       ce(2,6)  = 2.0d0
+       ce(2,7)  = 3.0d0
+       ce(2,8)  = 0.01d0
+       ce(2,9)  = 0.03d0
+       ce(2,10) = 0.02d0
+       ce(2,11) = 0.4d0
+       ce(2,12) = 0.3d0
+       ce(2,13) = 0.5d0
+
+       ce(3,1)  = 2.0d0
+       ce(3,2)  = 2.0d0
+       ce(3,3)  = 0.0d0
+       ce(3,4)  = 0.0d0
+       ce(3,5)  = 0.0d0
+       ce(3,6)  = 2.0d0
+       ce(3,7)  = 3.0d0
+       ce(3,8)  = 0.04d0
+       ce(3,9)  = 0.03d0
+       ce(3,10) = 0.05d0
+       ce(3,11) = 0.3d0
+       ce(3,12) = 0.5d0
+       ce(3,13) = 0.4d0
+
+       ce(4,1)  = 2.0d0
+       ce(4,2)  = 2.0d0
+       ce(4,3)  = 0.0d0
+       ce(4,4)  = 0.0d0
+       ce(4,5)  = 0.0d0
+       ce(4,6)  = 2.0d0
+       ce(4,7)  = 3.0d0
+       ce(4,8)  = 0.03d0
+       ce(4,9)  = 0.05d0
+       ce(4,10) = 0.04d0
+       ce(4,11) = 0.2d0
+       ce(4,12) = 0.1d0
+       ce(4,13) = 0.3d0
+
+       ce(5,1)  = 5.0d0
+       ce(5,2)  = 4.0d0
+       ce(5,3)  = 3.0d0
+       ce(5,4)  = 2.0d0
+       ce(5,5)  = 0.1d0
+       ce(5,6)  = 0.4d0
+       ce(5,7)  = 0.3d0
+       ce(5,8)  = 0.05d0
+       ce(5,9)  = 0.04d0
+       ce(5,10) = 0.03d0
+       ce(5,11) = 0.1d0
+       ce(5,12) = 0.3d0
+       ce(5,13) = 0.2d0
+
+       c1 = 1.4d0
+       c2 = 0.4d0
+       c3 = 0.1d0
+       c4 = 1.0d0
+       c5 = 1.4d0
+
+       bt = dsqrt(0.5d0)
+
+       dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+       dnym1 = 1.0d0 / dble(grid_points(2)-1)
+       dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+       c1c2 = c1 * c2
+       c1c5 = c1 * c5
+       c3c4 = c3 * c4
+       c1345 = c1c5 * c3c4
+
+       conz1 = (1.0d0-c1c5)
+
+       tx1 = 1.0d0 / (dnxm1 * dnxm1)
+       tx2 = 1.0d0 / (2.0d0 * dnxm1)
+       tx3 = 1.0d0 / dnxm1
+
+       ty1 = 1.0d0 / (dnym1 * dnym1)
+       ty2 = 1.0d0 / (2.0d0 * dnym1)
+       ty3 = 1.0d0 / dnym1
+ 
+       tz1 = 1.0d0 / (dnzm1 * dnzm1)
+       tz2 = 1.0d0 / (2.0d0 * dnzm1)
+       tz3 = 1.0d0 / dnzm1
+
+       dx1 = 0.75d0
+       dx2 = 0.75d0
+       dx3 = 0.75d0
+       dx4 = 0.75d0
+       dx5 = 0.75d0
+
+       dy1 = 0.75d0
+       dy2 = 0.75d0
+       dy3 = 0.75d0
+       dy4 = 0.75d0
+       dy5 = 0.75d0
+
+       dz1 = 1.0d0
+       dz2 = 1.0d0
+       dz3 = 1.0d0
+       dz4 = 1.0d0
+       dz5 = 1.0d0
+
+       dxmax = dmax1(dx3, dx4)
+       dymax = dmax1(dy2, dy4)
+       dzmax = dmax1(dz2, dz3)
+
+       dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+       c4dssp = 4.0d0 * dssp
+       c5dssp = 5.0d0 * dssp
+
+       dttx1 = dt*tx1
+       dttx2 = dt*tx2
+       dtty1 = dt*ty1
+       dtty2 = dt*ty2
+       dttz1 = dt*tz1
+       dttz2 = dt*tz2
+
+       c2dttx1 = 2.0d0*dttx1
+       c2dtty1 = 2.0d0*dtty1
+       c2dttz1 = 2.0d0*dttz1
+
+       dtdssp = dt*dssp
+
+       comz1  = dtdssp
+       comz4  = 4.0d0*dtdssp
+       comz5  = 5.0d0*dtdssp
+       comz6  = 6.0d0*dtdssp
+
+       c3c4tx3 = c3c4*tx3
+       c3c4ty3 = c3c4*ty3
+       c3c4tz3 = c3c4*tz3
+
+       dx1tx1 = dx1*tx1
+       dx2tx1 = dx2*tx1
+       dx3tx1 = dx3*tx1
+       dx4tx1 = dx4*tx1
+       dx5tx1 = dx5*tx1
+        
+       dy1ty1 = dy1*ty1
+       dy2ty1 = dy2*ty1
+       dy3ty1 = dy3*ty1
+       dy4ty1 = dy4*ty1
+       dy5ty1 = dy5*ty1
+        
+       dz1tz1 = dz1*tz1
+       dz2tz1 = dz2*tz1
+       dz3tz1 = dz3*tz1
+       dz4tz1 = dz4*tz1
+       dz5tz1 = dz5*tz1
+
+       c2iv  = 2.5d0
+       con43 = 4.0d0/3.0d0
+       con16 = 1.0d0/6.0d0
+        
+       xxcon1 = c3c4tx3*con43*tx3
+       xxcon2 = c3c4tx3*tx3
+       xxcon3 = c3c4tx3*conz1*tx3
+       xxcon4 = c3c4tx3*con16*tx3
+       xxcon5 = c3c4tx3*c1c5*tx3
+
+       yycon1 = c3c4ty3*con43*ty3
+       yycon2 = c3c4ty3*ty3
+       yycon3 = c3c4ty3*conz1*ty3
+       yycon4 = c3c4ty3*con16*ty3
+       yycon5 = c3c4ty3*c1c5*ty3
+
+       zzcon1 = c3c4tz3*con43*tz3
+       zzcon2 = c3c4tz3*tz3
+       zzcon3 = c3c4tz3*conz1*tz3
+       zzcon4 = c3c4tz3*con16*tz3
+       zzcon5 = c3c4tz3*c1c5*tz3
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/sp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/sp.f
new file mode 100644
index 0000000..6dd0591
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/sp.f
@@ -0,0 +1,213 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         !
+!                                                                         !
+!                      S E R I A L     V E R S I O N                      !
+!                                                                         !
+!                                   S P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is a serial version of the NPB SP code.               !
+!    Refer to NAS Technical Reports 95-020 and 99-011 for details.        !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.3. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.3, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+c---------------------------------------------------------------------
+c
+c Author: R. Van der Wijngaart
+c         W. Saphir
+c         H. Jin
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+       program SP
+c---------------------------------------------------------------------
+
+       include  'header.h'
+      
+       integer          i, niter, step, fstatus, n3
+       external         timer_read
+       double precision mflops, t, tmax, timer_read, trecs(t_last)
+       logical          verified
+       character        class
+       character        t_names(t_last)*8
+
+c---------------------------------------------------------------------
+c      Read input file (if it exists), else take
+c      defaults from parameters
+c---------------------------------------------------------------------
+          
+       open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+       if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_rdis2) = 'redist2'
+         t_names(t_tzetar) = 'tzetar'
+         t_names(t_ninvr) = 'ninvr'
+         t_names(t_pinvr) = 'pinvr'
+         t_names(t_txinvr) = 'txinvr'
+         t_names(t_add) = 'add'
+         close(2)
+       else
+         timeron = .false.
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputsp.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233) 
+ 233     format(' Reading from input file inputsp.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234) 
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputsp.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+       write(*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - SP Benchmark', /)
+ 1001     format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002     format(' Iterations: ', i4, '    dt: ', F10.6)
+
+       if ( (grid_points(1) .gt. IMAX) .or.
+     >      (grid_points(2) .gt. JMAX) .or.
+     >      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+       nx2 = grid_points(1) - 2
+       ny2 = grid_points(2) - 2
+       nz2 = grid_points(3) - 2
+
+       call set_constants
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call exact_rhs
+
+       call initialize
+
+c---------------------------------------------------------------------
+c      do one time step to touch all code, and reinitialize
+c---------------------------------------------------------------------
+       call adi
+       call initialize
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or. step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       tmax = timer_read(1)
+       
+       call verify(niter, class, verified)
+
+       if( tmax .ne. 0. ) then
+          n3 = grid_points(1)*grid_points(2)*grid_points(3)
+          t = (grid_points(1)+grid_points(2)+grid_points(3))/3.0
+          mflops = (881.174 * float( n3 )
+     >             -4683.91 * t**2
+     >             +11484.5 * t
+     >             -19272.4) * float( niter ) / (tmax*1000000.0d0)
+       else
+          mflops = 0.0
+       endif
+
+      call print_results('SP', class, grid_points(1), 
+     >     grid_points(2), grid_points(3), niter, 
+     >     tmax, mflops, '          floating point', 
+     >     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     >     cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+       if (tmax .eq. 0.0) tmax = 1.0
+
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_zsolve) then
+             t = trecs(t_zsolve) - trecs(t_rdis1) - trecs(t_rdis2)
+             write(*,820) 'sub-zsol', t, t*100./tmax
+          elseif (i.eq.t_rdis2) then
+             t = trecs(t_rdis1) + trecs(t_rdis2)
+             write(*,820) 'redist', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/txinvr.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/txinvr.f
new file mode 100644
index 0000000..c288979
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/txinvr.f
@@ -0,0 +1,58 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  txinvr
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c block-diagonal matrix-vector multiplication                  
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision t1, t2, t3, ac, ru1, uu, vv, ww, r1, r2, r3, 
+     >                  r4, r5, ac2inv
+
+
+       if (timeron) call timer_start(t_txinvr)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                ru1 = rho_i(i,j,k)
+                uu = us(i,j,k)
+                vv = vs(i,j,k)
+                ww = ws(i,j,k)
+                ac = speed(i,j,k)
+                ac2inv = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = c2 / ac2inv * ( qs(i,j,k)*r1 - uu*r2  - 
+     >                  vv*r3 - ww*r4 + r5 )
+                t2 = bt * ru1 * ( uu * r1 - r2 )
+                t3 = ( bt * ru1 * ac ) * t1
+
+                rhs(1,i,j,k) = r1 - t1
+                rhs(2,i,j,k) = - ru1 * ( ww*r1 - r4 )
+                rhs(3,i,j,k) =   ru1 * ( vv*r1 - r3 )
+                rhs(4,i,j,k) = - t2 + t3
+                rhs(5,i,j,k) =   t2 + t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_txinvr)
+
+       return
+       end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/tzetar.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/tzetar.f
new file mode 100644
index 0000000..5905248
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/tzetar.f
@@ -0,0 +1,59 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine  tzetar
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   block-diagonal matrix-vector multiplication                       
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k
+       double precision  t1, t2, t3, ac, xvel, yvel, zvel, r1, r2, r3, 
+     >                   r4, r5, btuz, ac2u, uzik1
+
+
+       if (timeron) call timer_start(t_tzetar)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                xvel = us(i,j,k)
+                yvel = vs(i,j,k)
+                zvel = ws(i,j,k)
+                ac   = speed(i,j,k)
+
+                ac2u = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)      
+
+                uzik1 = u(1,i,j,k)
+                btuz  = bt * uzik1
+
+                t1 = btuz/ac * (r4 + r5)
+                t2 = r3 + t1
+                t3 = btuz * (r4 - r5)
+
+                rhs(1,i,j,k) = t2
+                rhs(2,i,j,k) = -uzik1*r2 + xvel*t2
+                rhs(3,i,j,k) =  uzik1*r1 + yvel*t2
+                rhs(4,i,j,k) =  zvel*t2  + t3
+                rhs(5,i,j,k) =  uzik1*(-xvel*r2 + yvel*r1) + 
+     >                    qs(i,j,k)*t2 + c2iv*ac2u*t1 + zvel*t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_tzetar)
+
+       return
+       end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/verify.f
new file mode 100644
index 0000000..44e11c2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/verify.f
@@ -0,0 +1,356 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c  verification routine                         
+c---------------------------------------------------------------------
+
+        include 'header.h'
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), 
+     >                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+c---------------------------------------------------------------------
+c   tolerance level
+c---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+c---------------------------------------------------------------------
+c   compute the error norm and the residual norm, and exit if not printing
+c---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+c---------------------------------------------------------------------
+c    reference data for 12X12X12 grids after 100 time steps, with DT = 1.50d-02
+c---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and. 
+     >       (grid_points(2)  .eq. 12     ) .and.
+     >       (grid_points(3)  .eq. 12     ) .and.
+     >       (no_time_steps   .eq. 100    ))  then
+
+           class = 'S'
+           dtref = 1.5d-2
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.7470315451339479d-02
+           xcrref(2) = 1.0360746705285417d-02
+           xcrref(3) = 1.6235745065095532d-02
+           xcrref(4) = 1.5840557224455615d-02
+           xcrref(5) = 3.4849040609362460d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 2.7289258557377227d-05
+           xceref(2) = 1.0364446640837285d-05
+           xceref(3) = 1.6154798287166471d-05
+           xceref(4) = 1.5750704994480102d-05
+           xceref(5) = 3.4177666183390531d-05
+
+
+c---------------------------------------------------------------------
+c    reference data for 36X36X36 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 36) .and. 
+     >           (grid_points(2) .eq. 36) .and.
+     >           (grid_points(3) .eq. 36) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'W'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1893253733584d-02
+           xcrref(2) = 0.1717075447775d-03
+           xcrref(3) = 0.2778153350936d-03
+           xcrref(4) = 0.2887475409984d-03
+           xcrref(5) = 0.3143611161242d-02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.7542088599534d-04
+           xceref(2) = 0.6512852253086d-05
+           xceref(3) = 0.1049092285688d-04
+           xceref(4) = 0.1128838671535d-04
+           xceref(5) = 0.1212845639773d-03
+
+c---------------------------------------------------------------------
+c    reference data for 64X64X64 grids after 400 time steps, with DT = 1.5d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and. 
+     >           (grid_points(2) .eq. 64) .and.
+     >           (grid_points(3) .eq. 64) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'A'
+           dtref = 1.5d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 2.4799822399300195d0
+           xcrref(2) = 1.1276337964368832d0
+           xcrref(3) = 1.5028977888770491d0
+           xcrref(4) = 1.4217816211695179d0
+           xcrref(5) = 2.1292113035138280d0
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 1.0900140297820550d-04
+           xceref(2) = 3.7343951769282091d-05
+           xceref(3) = 5.0092785406541633d-05
+           xceref(4) = 4.7671093939528255d-05
+           xceref(5) = 1.3621613399213001d-04
+
+c---------------------------------------------------------------------
+c    reference data for 102X102X102 grids after 400 time steps,
+c    with DT = 1.0d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and. 
+     >           (grid_points(2) .eq. 102) .and.
+     >           (grid_points(3) .eq. 102) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'B'
+           dtref = 1.0d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6903293579998d+02
+           xcrref(2) = 0.3095134488084d+02
+           xcrref(3) = 0.4103336647017d+02
+           xcrref(4) = 0.3864769009604d+02
+           xcrref(5) = 0.5643482272596d+02
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.9810006190188d-02
+           xceref(2) = 0.1022827905670d-02
+           xceref(3) = 0.1720597911692d-02
+           xceref(4) = 0.1694479428231d-02
+           xceref(5) = 0.1847456263981d-01
+
+c---------------------------------------------------------------------
+c    reference data for 162X162X162 grids after 400 time steps,
+c    with DT = 0.67d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and. 
+     >           (grid_points(2) .eq. 162) .and.
+     >           (grid_points(3) .eq. 162) .and.
+     >           (no_time_steps . eq. 400) ) then
+
+           class = 'C'
+           dtref = 0.67d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.5881691581829d+03
+           xcrref(2) = 0.2454417603569d+03
+           xcrref(3) = 0.3293829191851d+03
+           xcrref(4) = 0.3081924971891d+03
+           xcrref(5) = 0.4597223799176d+03
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.2598120500183d+00
+           xceref(2) = 0.2590888922315d-01
+           xceref(3) = 0.5132886416320d-01
+           xceref(4) = 0.4806073419454d-01
+           xceref(5) = 0.5483377491301d+00
+
+c---------------------------------------------------------------------
+c    reference data for 408X408X408 grids after 500 time steps,
+c    with DT = 0.3d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and. 
+     >           (grid_points(2) .eq. 408) .and.
+     >           (grid_points(3) .eq. 408) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'D'
+           dtref = 0.30d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.1044696216887d+05
+           xcrref(2) = 0.3204427762578d+04
+           xcrref(3) = 0.4648680733032d+04
+           xcrref(4) = 0.4238923283697d+04
+           xcrref(5) = 0.7588412036136d+04
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.5089471423669d+01
+           xceref(2) = 0.5323514855894d+00
+           xceref(3) = 0.1187051008971d+01
+           xceref(4) = 0.1083734951938d+01
+           xceref(5) = 0.1164108338568d+02
+
+c---------------------------------------------------------------------
+c    reference data for 1020X1020X1020 grids after 500 time steps,
+c    with DT = 0.1d-03
+c---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and. 
+     >           (grid_points(2) .eq. 1020) .and.
+     >           (grid_points(3) .eq. 1020) .and.
+     >           (no_time_steps . eq. 500) ) then
+
+           class = 'E'
+           dtref = 0.10d-3
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of residual.
+c---------------------------------------------------------------------
+           xcrref(1) = 0.6255387422609d+05
+           xcrref(2) = 0.1495317020012d+05
+           xcrref(3) = 0.2347595750586d+05
+           xcrref(4) = 0.2091099783534d+05
+           xcrref(5) = 0.4770412841218d+05
+
+c---------------------------------------------------------------------
+c    Reference values of RMS-norms of solution error.
+c---------------------------------------------------------------------
+           xceref(1) = 0.6742735164909d+02
+           xceref(2) = 0.5390656036938d+01
+           xceref(3) = 0.1680647196477d+02
+           xceref(4) = 0.1536963126457d+02
+           xceref(5) = 0.1575330146156d+03
+
+
+        else
+           verified = .false.
+        endif
+
+c---------------------------------------------------------------------
+c    verification test for residuals if gridsize is one of 
+c    the defined grid sizes above (class .ne. 'U')
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c    Compute the difference of solution values and the known reference values.
+c---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ', 
+     >                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if (xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if (xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/x_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/x_solve.f
new file mode 100644
index 0000000..0a01982
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/x_solve.f
@@ -0,0 +1,327 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine x_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the x-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the x-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, i1, i2, m
+       double precision  ru1, fac1, fac2
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+       do  k = 1, nz2
+
+          call lhsinit(nx2+1, ny2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three x-factors  
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue                   
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do  i = 0, grid_points(1)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(i) = us(i,j,k)
+                rhon(i) = dmax1(dx2+con43*ru1, 
+     >                          dx5+c1c5*ru1,
+     >                          dxmax+ru1,
+     >                          dx1)
+             end do
+
+             do  i = 1, nx2
+                lhs(1,i,j) =   0.0d0
+                lhs(2,i,j) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
+                lhs(3,i,j) =   1.0d0 + c2dttx1 * rhon(i)
+                lhs(4,i,j) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
+                lhs(5,i,j) =   0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             i = 1
+             lhs(3,i,j) = lhs(3,i,j) + comz5
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+             lhs(5,i,j) = lhs(5,i,j) + comz1
+  
+             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
+             lhs(3,i+1,j) = lhs(3,i+1,j) + comz6
+             lhs(4,i+1,j) = lhs(4,i+1,j) - comz4
+             lhs(5,i+1,j) = lhs(5,i+1,j) + comz1
+          end do
+
+          do  j = 1, ny2
+             do   i=3, grid_points(1)-4
+                lhs(1,i,j) = lhs(1,i,j) + comz1
+                lhs(2,i,j) = lhs(2,i,j) - comz4
+                lhs(3,i,j) = lhs(3,i,j) + comz6
+                lhs(4,i,j) = lhs(4,i,j) - comz4
+                lhs(5,i,j) = lhs(5,i,j) + comz1
+             end do
+          end do
+
+
+          do  j = 1, ny2
+             i = grid_points(1)-3
+             lhs(1,i,j) = lhs(1,i,j) + comz1
+             lhs(2,i,j) = lhs(2,i,j) - comz4
+             lhs(3,i,j) = lhs(3,i,j) + comz6
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+
+             lhs(1,i+1,j) = lhs(1,i+1,j) + comz1
+             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
+             lhs(3,i+1,j) = lhs(3,i+1,j) + comz5
+          end do
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) by adding to 
+c      the first  
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do   i = 1, nx2
+                lhsp(1,i,j) = lhs(1,i,j)
+                lhsp(2,i,j) = lhs(2,i,j) - 
+     >                            dttx2 * speed(i-1,j,k)
+                lhsp(3,i,j) = lhs(3,i,j)
+                lhsp(4,i,j) = lhs(4,i,j) + 
+     >                            dttx2 * speed(i+1,j,k)
+                lhsp(5,i,j) = lhs(5,i,j)
+                lhsm(1,i,j) = lhs(1,i,j)
+                lhsm(2,i,j) = lhs(2,i,j) + 
+     >                            dttx2 * speed(i-1,j,k)
+                lhsm(3,i,j) = lhs(3,i,j)
+                lhsm(4,i,j) = lhs(4,i,j) - 
+     >                            dttx2 * speed(i+1,j,k)
+                lhsm(5,i,j) = lhs(5,i,j)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      perform the Thomas algorithm; first, FORWARD ELIMINATION     
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                fac1      = 1.d0/lhs(3,i,j)
+                lhs(4,i,j)  = fac1*lhs(4,i,j)
+                lhs(5,i,j)  = fac1*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i1,j) = lhs(3,i1,j) -
+     >                         lhs(2,i1,j)*lhs(4,i,j)
+                lhs(4,i1,j) = lhs(4,i1,j) -
+     >                         lhs(2,i1,j)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                         lhs(2,i1,j)*rhs(m,i,j,k)
+                end do
+                lhs(2,i2,j) = lhs(2,i2,j) -
+     >                         lhs(1,i2,j)*lhs(4,i,j)
+                lhs(3,i2,j) = lhs(3,i2,j) -
+     >                         lhs(1,i2,j)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                         lhs(1,i2,j)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             fac1      = 1.d0/lhs(3,i,j)
+             lhs(4,i,j)  = fac1*lhs(4,i,j)
+             lhs(5,i,j)  = fac1*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i1,j) = lhs(3,i1,j) -
+     >                      lhs(2,i1,j)*lhs(4,i,j)
+             lhs(4,i1,j) = lhs(4,i1,j) -
+     >                      lhs(2,i1,j)*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhs(2,i1,j)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c            scale the last row immediately 
+c---------------------------------------------------------------------
+             fac2             = 1.d0/lhs(3,i1,j)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = fac2*rhs(m,i1,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+
+          do  j = 1, ny2
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,j)
+                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,i1,j) = lhsp(3,i1,j) -
+     >                        lhsp(2,i1,j)*lhsp(4,i,j)
+                lhsp(4,i1,j) = lhsp(4,i1,j) -
+     >                        lhsp(2,i1,j)*lhsp(5,i,j)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                        lhsp(2,i1,j)*rhs(m,i,j,k)
+                lhsp(2,i2,j) = lhsp(2,i2,j) -
+     >                        lhsp(1,i2,j)*lhsp(4,i,j)
+                lhsp(3,i2,j) = lhsp(3,i2,j) -
+     >                        lhsp(1,i2,j)*lhsp(5,i,j)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                        lhsp(1,i2,j)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,j)
+                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,i1,j) = lhsm(3,i1,j) -
+     >                        lhsm(2,i1,j)*lhsm(4,i,j)
+                lhsm(4,i1,j) = lhsm(4,i1,j) -
+     >                        lhsm(2,i1,j)*lhsm(5,i,j)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                        lhsm(2,i1,j)*rhs(m,i,j,k)
+                lhsm(2,i2,j) = lhsm(2,i2,j) -
+     >                        lhsm(1,i2,j)*lhsm(4,i,j)
+                lhsm(3,i2,j) = lhsm(3,i2,j) -
+     >                        lhsm(1,i2,j)*lhsm(5,i,j)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
+     >                        lhsm(1,i2,j)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,j)
+             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,i1,j) = lhsp(3,i1,j) -
+     >                      lhsp(2,i1,j)*lhsp(4,i,j)
+             lhsp(4,i1,j) = lhsp(4,i1,j) -
+     >                      lhsp(2,i1,j)*lhsp(5,i,j)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhsp(2,i1,j)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,j)
+             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,i1,j) = lhsm(3,i1,j) -
+     >                      lhsm(2,i1,j)*lhsm(4,i,j)
+             lhsm(4,i1,j) = lhsm(4,i1,j) -
+     >                      lhsm(2,i1,j)*lhsm(5,i,j)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
+     >                      lhsm(2,i1,j)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately
+c---------------------------------------------------------------------
+             rhs(4,i1,j,k) = rhs(4,i1,j,k)/lhsp(3,i1,j)
+             rhs(5,i1,j,k) = rhs(5,i1,j,k)/lhsm(3,i1,j)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+
+          do  j = 1, ny2
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                             lhs(4,i,j)*rhs(m,i1,j,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                          lhsp(4,i,j)*rhs(4,i1,j,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                          lhsm(4,i,j)*rhs(5,i1,j,k)
+          end do
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do  j = 1, ny2
+             do    i = grid_points(1)-3, 0, -1
+                i1 = i  + 1
+                i2 = i  + 2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,j)*rhs(m,i1,j,k) -
+     >                          lhs(5,i,j)*rhs(m,i2,j,k)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,j)*rhs(4,i1,j,k) -
+     >                          lhsp(5,i,j)*rhs(4,i2,j,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,j)*rhs(5,i1,j,k) -
+     >                          lhsm(5,i,j)*rhs(5,i2,j,k)
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_xsolve)
+
+c---------------------------------------------------------------------
+c      Do the block-diagonal inversion          
+c---------------------------------------------------------------------
+       call ninvr
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/y_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/y_solve.f
new file mode 100644
index 0000000..e78eaab
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/y_solve.f
@@ -0,0 +1,320 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine y_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the y-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the y-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, j1, j2, m
+       double precision ru1, fac1, fac2
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+       do  k = 1, grid_points(3)-2
+
+          call lhsinitj(ny2+1, nx2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three y-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      first fill the lhs for the u-eigenvalue         
+c---------------------------------------------------------------------
+
+          do  i = 1, grid_points(1)-2
+             do  j = 0, grid_points(2)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(j) = vs(i,j,k)
+                rhoq(j) = dmax1( dy3 + con43 * ru1,
+     >                           dy5 + c1c5*ru1,
+     >                           dymax + ru1,
+     >                           dy1)
+             end do
+            
+             do  j = 1, grid_points(2)-2
+                lhs(1,i,j) =  0.0d0
+                lhs(2,i,j) = -dtty2 * cv(j-1) - dtty1 * rhoq(j-1)
+                lhs(3,i,j) =  1.0 + c2dtty1 * rhoq(j)
+                lhs(4,i,j) =  dtty2 * cv(j+1) - dtty1 * rhoq(j+1)
+                lhs(5,i,j) =  0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                             
+c---------------------------------------------------------------------
+
+          do  i = 1, grid_points(1)-2
+             j = 1
+             lhs(3,i,j) = lhs(3,i,j) + comz5
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+             lhs(5,i,j) = lhs(5,i,j) + comz1
+       
+             lhs(2,i,j+1) = lhs(2,i,j+1) - comz4
+             lhs(3,i,j+1) = lhs(3,i,j+1) + comz6
+             lhs(4,i,j+1) = lhs(4,i,j+1) - comz4
+             lhs(5,i,j+1) = lhs(5,i,j+1) + comz1
+          end do
+
+          do   j=3, grid_points(2)-4
+             do  i = 1, grid_points(1)-2
+
+                lhs(1,i,j) = lhs(1,i,j) + comz1
+                lhs(2,i,j) = lhs(2,i,j) - comz4
+                lhs(3,i,j) = lhs(3,i,j) + comz6
+                lhs(4,i,j) = lhs(4,i,j) - comz4
+                lhs(5,i,j) = lhs(5,i,j) + comz1
+             end do
+          end do
+
+          do  i = 1, grid_points(1)-2
+             j = grid_points(2)-3
+             lhs(1,i,j) = lhs(1,i,j) + comz1
+             lhs(2,i,j) = lhs(2,i,j) - comz4
+             lhs(3,i,j) = lhs(3,i,j) + comz6
+             lhs(4,i,j) = lhs(4,i,j) - comz4
+
+             lhs(1,i,j+1) = lhs(1,i,j+1) + comz1
+             lhs(2,i,j+1) = lhs(2,i,j+1) - comz4
+             lhs(3,i,j+1) = lhs(3,i,j+1) + comz5
+          end do
+
+c---------------------------------------------------------------------
+c      subsequently, do the other two factors                    
+c---------------------------------------------------------------------
+          do    j = 1, grid_points(2)-2
+             do  i = 1, grid_points(1)-2
+                lhsp(1,i,j) = lhs(1,i,j)
+                lhsp(2,i,j) = lhs(2,i,j) - 
+     >                            dtty2 * speed(i,j-1,k)
+                lhsp(3,i,j) = lhs(3,i,j)
+                lhsp(4,i,j) = lhs(4,i,j) + 
+     >                            dtty2 * speed(i,j+1,k)
+                lhsp(5,i,j) = lhs(5,i,j)
+                lhsm(1,i,j) = lhs(1,i,j)
+                lhsm(2,i,j) = lhs(2,i,j) + 
+     >                            dtty2 * speed(i,j-1,k)
+                lhsm(3,i,j) = lhs(3,i,j)
+                lhsm(4,i,j) = lhs(4,i,j) - 
+     >                            dtty2 * speed(i,j+1,k)
+                lhsm(5,i,j) = lhs(5,i,j)
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                fac1      = 1.d0/lhs(3,i,j)
+                lhs(4,i,j)  = fac1*lhs(4,i,j)
+                lhs(5,i,j)  = fac1*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i,j1) = lhs(3,i,j1) -
+     >                         lhs(2,i,j1)*lhs(4,i,j)
+                lhs(4,i,j1) = lhs(4,i,j1) -
+     >                         lhs(2,i,j1)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                         lhs(2,i,j1)*rhs(m,i,j,k)
+                end do
+                lhs(2,i,j2) = lhs(2,i,j2) -
+     >                         lhs(1,i,j2)*lhs(4,i,j)
+                lhs(3,i,j2) = lhs(3,i,j2) -
+     >                         lhs(1,i,j2)*lhs(5,i,j)
+                do    m = 1, 3
+                   rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                         lhs(1,i,j2)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             fac1      = 1.d0/lhs(3,i,j)
+             lhs(4,i,j)  = fac1*lhs(4,i,j)
+             lhs(5,i,j)  = fac1*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i,j1) = lhs(3,i,j1) -
+     >                      lhs(2,i,j1)*lhs(4,i,j)
+             lhs(4,i,j1) = lhs(4,i,j1) -
+     >                      lhs(2,i,j1)*lhs(5,i,j)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                      lhs(2,i,j1)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c            scale the last row immediately 
+c---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,i,j1)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = fac2*rhs(m,i,j1,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors                 
+c---------------------------------------------------------------------
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,j)
+                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,i,j1) = lhsp(3,i,j1) -
+     >                       lhsp(2,i,j1)*lhsp(4,i,j)
+                lhsp(4,i,j1) = lhsp(4,i,j1) -
+     >                       lhsp(2,i,j1)*lhsp(5,i,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                       lhsp(2,i,j1)*rhs(m,i,j,k)
+                lhsp(2,i,j2) = lhsp(2,i,j2) -
+     >                       lhsp(1,i,j2)*lhsp(4,i,j)
+                lhsp(3,i,j2) = lhsp(3,i,j2) -
+     >                       lhsp(1,i,j2)*lhsp(5,i,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                       lhsp(1,i,j2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,j)
+                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,i,j1) = lhsm(3,i,j1) -
+     >                       lhsm(2,i,j1)*lhsm(4,i,j)
+                lhsm(4,i,j1) = lhsm(4,i,j1) -
+     >                       lhsm(2,i,j1)*lhsm(5,i,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -
+     >                       lhsm(2,i,j1)*rhs(m,i,j,k)
+                lhsm(2,i,j2) = lhsm(2,i,j2) -
+     >                       lhsm(1,i,j2)*lhsm(4,i,j)
+                lhsm(3,i,j2) = lhsm(3,i,j2) -
+     >                       lhsm(1,i,j2)*lhsm(5,i,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -
+     >                       lhsm(1,i,j2)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,j)
+             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
+             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,i,j1) = lhsp(3,i,j1) -
+     >                    lhsp(2,i,j1)*lhsp(4,i,j)
+             lhsp(4,i,j1) = lhsp(4,i,j1) -
+     >                    lhsp(2,i,j1)*lhsp(5,i,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -
+     >                    lhsp(2,i,j1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,j)
+             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
+             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,i,j1) = lhsm(3,i,j1) -
+     >                    lhsm(2,i,j1)*lhsm(4,i,j)
+             lhsm(4,i,j1) = lhsm(4,i,j1) -
+     >                    lhsm(2,i,j1)*lhsm(5,i,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -
+     >                    lhsm(2,i,j1)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately 
+c---------------------------------------------------------------------
+             rhs(4,i,j1,k)   = rhs(4,i,j1,k)/lhsp(3,i,j1)
+             rhs(5,i,j1,k)   = rhs(5,i,j1,k)/lhsm(3,i,j1)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  i = 1, grid_points(1)-2
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                           lhs(4,i,j)*rhs(m,i,j1,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                           lhsp(4,i,j)*rhs(4,i,j1,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                           lhsm(4,i,j)*rhs(5,i,j1,k)
+          end do
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do   j = grid_points(2)-3, 0, -1
+             j1 = j  + 1
+             j2 = j  + 2
+             do  i = 1, grid_points(1)-2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,j)*rhs(m,i,j1,k) -
+     >                          lhs(5,i,j)*rhs(m,i,j2,k)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,j)*rhs(4,i,j1,k) -
+     >                          lhsp(5,i,j)*rhs(4,i,j2,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,j)*rhs(5,i,j1,k) -
+     >                          lhsm(5,i,j)*rhs(5,i,j2,k)
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_ysolve)
+
+       call pinvr
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/z_solve.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/z_solve.f
new file mode 100644
index 0000000..9b7634e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/SP/z_solve.f
@@ -0,0 +1,328 @@
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       subroutine z_solve
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c this function performs the solution of the approximate factorization
+c step in the z-direction for all five matrix components
+c simultaneously. The Thomas algorithm is employed to solve the
+c systems for the z-lines. Boundary conditions are non-periodic
+c---------------------------------------------------------------------
+
+       include 'header.h'
+
+       integer i, j, k, k1, k2, m
+       double precision ru1, fac1, fac2
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+       do   j = 1, ny2
+
+          call lhsinitj(nz2+1, nx2)
+
+c---------------------------------------------------------------------
+c Computes the left hand side for the three z-factors   
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c first fill the lhs for the u-eigenvalue                          
+c---------------------------------------------------------------------
+
+          do   i = 1, nx2
+             do   k = 0, nz2+1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(k) = ws(i,j,k)
+                rhos(k) = dmax1(dz4 + con43 * ru1,
+     >                          dz5 + c1c5 * ru1,
+     >                          dzmax + ru1,
+     >                          dz1)
+             end do
+
+             do   k =  1, nz2
+                lhs(1,i,k) =  0.0d0
+                lhs(2,i,k) = -dttz2 * cv(k-1) - dttz1 * rhos(k-1)
+                lhs(3,i,k) =  1.0 + c2dttz1 * rhos(k)
+                lhs(4,i,k) =  dttz2 * cv(k+1) - dttz1 * rhos(k+1)
+                lhs(5,i,k) =  0.0d0
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      add fourth order dissipation                                  
+c---------------------------------------------------------------------
+
+          do   i = 1, nx2
+             k = 1
+             lhs(3,i,k) = lhs(3,i,k) + comz5
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+             lhs(5,i,k) = lhs(5,i,k) + comz1
+
+             k = 2
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz6
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+             lhs(5,i,k) = lhs(5,i,k) + comz1
+          end do
+
+          do    k = 3, nz2-2
+             do   i = 1, nx2
+                lhs(1,i,k) = lhs(1,i,k) + comz1
+                lhs(2,i,k) = lhs(2,i,k) - comz4
+                lhs(3,i,k) = lhs(3,i,k) + comz6
+                lhs(4,i,k) = lhs(4,i,k) - comz4
+                lhs(5,i,k) = lhs(5,i,k) + comz1
+             end do
+          end do
+
+          do   i = 1, nx2
+             k = nz2-1
+             lhs(1,i,k) = lhs(1,i,k) + comz1
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz6
+             lhs(4,i,k) = lhs(4,i,k) - comz4
+
+             k = nz2
+             lhs(1,i,k) = lhs(1,i,k) + comz1
+             lhs(2,i,k) = lhs(2,i,k) - comz4
+             lhs(3,i,k) = lhs(3,i,k) + comz5
+          end do
+
+
+c---------------------------------------------------------------------
+c      subsequently, fill the other factors (u+c), (u-c) 
+c---------------------------------------------------------------------
+          do    k = 1, nz2
+             do   i = 1, nx2
+                lhsp(1,i,k) = lhs(1,i,k)
+                lhsp(2,i,k) = lhs(2,i,k) - 
+     >                            dttz2 * speed(i,j,k-1)
+                lhsp(3,i,k) = lhs(3,i,k)
+                lhsp(4,i,k) = lhs(4,i,k) + 
+     >                            dttz2 * speed(i,j,k+1)
+                lhsp(5,i,k) = lhs(5,i,k)
+                lhsm(1,i,k) = lhs(1,i,k)
+                lhsm(2,i,k) = lhs(2,i,k) + 
+     >                            dttz2 * speed(i,j,k-1)
+                lhsm(3,i,k) = lhs(3,i,k)
+                lhsm(4,i,k) = lhs(4,i,k) - 
+     >                            dttz2 * speed(i,j,k+1)
+                lhsm(5,i,k) = lhs(5,i,k)
+             end do
+          end do
+
+
+c---------------------------------------------------------------------
+c                          FORWARD ELIMINATION  
+c---------------------------------------------------------------------
+
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                fac1      = 1.d0/lhs(3,i,k)
+                lhs(4,i,k)  = fac1*lhs(4,i,k)
+                lhs(5,i,k)  = fac1*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i,k1) = lhs(3,i,k1) -
+     >                         lhs(2,i,k1)*lhs(4,i,k)
+                lhs(4,i,k1) = lhs(4,i,k1) -
+     >                         lhs(2,i,k1)*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                         lhs(2,i,k1)*rhs(m,i,j,k)
+                end do
+                lhs(2,i,k2) = lhs(2,i,k2) -
+     >                         lhs(1,i,k2)*lhs(4,i,k)
+                lhs(3,i,k2) = lhs(3,i,k2) -
+     >                         lhs(1,i,k2)*lhs(5,i,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                         lhs(1,i,k2)*rhs(m,i,j,k)
+                end do
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      The last two rows in this grid block are a bit different, 
+c      since they do not have two more rows available for the
+c      elimination of off-diagonal entries
+c---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             fac1      = 1.d0/lhs(3,i,k)
+             lhs(4,i,k)  = fac1*lhs(4,i,k)
+             lhs(5,i,k)  = fac1*lhs(5,i,k)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i,k1) = lhs(3,i,k1) -
+     >                      lhs(2,i,k1)*lhs(4,i,k)
+             lhs(4,i,k1) = lhs(4,i,k1) -
+     >                      lhs(2,i,k1)*lhs(5,i,k)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                      lhs(2,i,k1)*rhs(m,i,j,k)
+             end do
+c---------------------------------------------------------------------
+c               scale the last row immediately
+c---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,i,k1)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = fac2*rhs(m,i,j,k1)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c      do the u+c and the u-c factors               
+c---------------------------------------------------------------------
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i,k)
+                lhsp(4,i,k)  = fac1*lhsp(4,i,k)
+                lhsp(5,i,k)  = fac1*lhsp(5,i,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsp(3,i,k1) = lhsp(3,i,k1) -
+     >                       lhsp(2,i,k1)*lhsp(4,i,k)
+                lhsp(4,i,k1) = lhsp(4,i,k1) -
+     >                       lhsp(2,i,k1)*lhsp(5,i,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                       lhsp(2,i,k1)*rhs(m,i,j,k)
+                lhsp(2,i,k2) = lhsp(2,i,k2) -
+     >                       lhsp(1,i,k2)*lhsp(4,i,k)
+                lhsp(3,i,k2) = lhsp(3,i,k2) -
+     >                       lhsp(1,i,k2)*lhsp(5,i,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                       lhsp(1,i,k2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i,k)
+                lhsm(4,i,k)  = fac1*lhsm(4,i,k)
+                lhsm(5,i,k)  = fac1*lhsm(5,i,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsm(3,i,k1) = lhsm(3,i,k1) -
+     >                       lhsm(2,i,k1)*lhsm(4,i,k)
+                lhsm(4,i,k1) = lhsm(4,i,k1) -
+     >                       lhsm(2,i,k1)*lhsm(5,i,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                       lhsm(2,i,k1)*rhs(m,i,j,k)
+                lhsm(2,i,k2) = lhsm(2,i,k2) -
+     >                       lhsm(1,i,k2)*lhsm(4,i,k)
+                lhsm(3,i,k2) = lhsm(3,i,k2) -
+     >                       lhsm(1,i,k2)*lhsm(5,i,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -
+     >                       lhsm(1,i,k2)*rhs(m,i,j,k)
+             end do
+          end do
+
+c---------------------------------------------------------------------
+c         And again the last two rows separately
+c---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             m = 4
+             fac1       = 1.d0/lhsp(3,i,k)
+             lhsp(4,i,k)  = fac1*lhsp(4,i,k)
+             lhsp(5,i,k)  = fac1*lhsp(5,i,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsp(3,i,k1) = lhsp(3,i,k1) -
+     >                    lhsp(2,i,k1)*lhsp(4,i,k)
+             lhsp(4,i,k1) = lhsp(4,i,k1) -
+     >                    lhsp(2,i,k1)*lhsp(5,i,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                    lhsp(2,i,k1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i,k)
+             lhsm(4,i,k)  = fac1*lhsm(4,i,k)
+             lhsm(5,i,k)  = fac1*lhsm(5,i,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsm(3,i,k1) = lhsm(3,i,k1) -
+     >                    lhsm(2,i,k1)*lhsm(4,i,k)
+             lhsm(4,i,k1) = lhsm(4,i,k1) -
+     >                    lhsm(2,i,k1)*lhsm(5,i,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -
+     >                    lhsm(2,i,k1)*rhs(m,i,j,k)
+c---------------------------------------------------------------------
+c               Scale the last row immediately (some of this is overkill
+c               if this is the last cell)
+c---------------------------------------------------------------------
+             rhs(4,i,j,k1) = rhs(4,i,j,k1)/lhsp(3,i,k1)
+             rhs(5,i,j,k1) = rhs(5,i,j,k1)/lhsm(3,i,k1)
+          end do
+
+
+c---------------------------------------------------------------------
+c                         BACKSUBSTITUTION 
+c---------------------------------------------------------------------
+
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do   i = 1, nx2
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -
+     >                             lhs(4,i,k)*rhs(m,i,j,k1)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -
+     >                             lhsp(4,i,k)*rhs(4,i,j,k1)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -
+     >                             lhsm(4,i,k)*rhs(5,i,j,k1)
+          end do
+
+c---------------------------------------------------------------------
+c      Whether or not this is the last processor, we always have
+c      to complete the back-substitution 
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c      The first three factors
+c---------------------------------------------------------------------
+          do   k = grid_points(3)-3, 0, -1
+             k1 = k  + 1
+             k2 = k  + 2
+             do   i = 1, nx2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
+     >                          lhs(4,i,k)*rhs(m,i,j,k1) -
+     >                          lhs(5,i,k)*rhs(m,i,j,k2)
+                end do
+
+c---------------------------------------------------------------------
+c      And the remaining two
+c---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) - 
+     >                          lhsp(4,i,k)*rhs(4,i,j,k1) -
+     >                          lhsp(5,i,k)*rhs(4,i,j,k2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) - 
+     >                          lhsm(4,i,k)*rhs(5,i,j,k1) -
+     >                          lhsm(5,i,k)*rhs(5,i,j,k2)
+             end do
+
+          end do
+       end do
+       if (timeron) call timer_stop(t_zsolve)
+
+       call tzetar
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/Makefile
new file mode 100644
index 0000000..54d4096
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/Makefile
@@ -0,0 +1,44 @@
+SHELL=/bin/sh
+BENCHMARK=ua
+BENCHMARKU=UA
+
+include ../config/make.def
+
+
+OBJS = ua.o convect.o diffuse.o adapt.o move.o mason.o \
+       precond.o utils.o transfer.o verify.o  setup.o\
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+include ../sys/make.common
+
+# npbparams.h is included by header.h
+# The following rule should do the trick but many make programs (not gmake)
+# will do the wrong thing and rebuild the world every time (because the
+# mod time on header.h is not changed. One solution would be to 
+# touch header.h but this might cause confusion if someone has
+# accidentally deleted it. Instead, make the dependency on npbparams.h
+# explicit in all the lines below (even though dependence is indirect). 
+
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f.o:
+	${FCOMPILE} $<
+
+ua.o:        ua.f       header.h npbparams.h
+setup.o:     setup.f    header.h npbparams.h
+convect.o:   convect.f  header.h npbparams.h
+adapt.o:     adapt.f    header.h npbparams.h
+move.o:      move.f     header.h npbparams.h
+diffuse.o:   diffuse.f  header.h npbparams.h
+mason.o:     mason.f    header.h npbparams.h
+precond.o:   precond.f  header.h npbparams.h
+transfer.o:  transfer.f header.h npbparams.h
+utils.o:     utils.f    header.h npbparams.h
+verify.o:    verify.f   header.h npbparams.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/README
new file mode 100644
index 0000000..8b3196b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/README
@@ -0,0 +1,122 @@
+This file gives explaination of some terms used in the commments of the code. 
+
+1. Face Index:
+defined on an element.
+range: 1,2,...,6
+order:(when facing an element)its right, left,back,front,top and bottom.
+
+2. Local Edge Index:  
+defined on a face. 
+range: 1,2,3,4 
+order: bottom, right, top and left edge (of a face).
+
+3. Global Edge Index: 
+defined on an element. 
+range: 1,2,3,...,11,12.
+Order: 
+1,2,3,4:  local edge 1,2,3,4 on face 1
+5,6,7,8:  local edge 1,2,3,4 on face 2
+9,10   :  local edge 1,3, on face 3
+11,12  :  local edge 1,3 on face 4
+
+4.Local Corner Index
+defined on a face
+range : 1,2,3,4
+order : left_bottom, right_bottom,left_top,right_top
+
+5. Vertex Index
+defined on an element
+range: 1,2,...,7,8
+order: 
+1,2,3,4: local corner 1~4 on face 6
+5,6,7,8: local corner 5~8 on face 5
+
+6. Face type
+defined on an element
+range: 0,1,2,3
+type 0: domain boundary( no neighbor element)
+type 1: neighbor element is larger in size
+type 2: neighbor element is of the same size
+type 3: neighbor element is smaller in size
+
+7. Nonconforming face
+defined on an element
+face of type 3
+
+8. Conforming face
+defined on an element
+face of type 0,1,or 2
+
+9.Nonconforming edge
+defined on an element
+The whole edge is shared by a face (of any element) of type 3. 
+
+10.conforming edge
+defined on an element
+All elements sharing this edge are of the same or larger size. 
+ 
+11.mortar location
+defined on each face
+range (1,1) (1,2) (2,1) (2,2)
+order:
+Each face has either 1 or 4 pieces (2 by 2 in each direction) of mortar. 
+If it has one piece of mortar, the mortar location is (1,1).
+If a face has four pieces of mortar, (1,1) refers to the left_bottom,
+(1,2) refers to right_bottom, (2,1) refers to left_top, (2,2) refers to
+right_top.
+
+
+13. For situations that nonconforming edges exist on a conforming face,ii and
+jj used in idmo(i,j,ii,jj,face,iel) for the nonconforming edge is the same 
+as the face is nonconforming. e.g. if local edge 2 is nonconforming, 
+mortar indices on this edge are idmo(i,j,1,2,face,iel) (1<j<5,i=lx1) and 
+idmo(i,j,2,2,face,iel) (1<j<5, i=lx1) although for i <> lx1, idmo(i,j,1,2,face,iel) and idmo(i,j,2,2,face,iel) =0 (not exist)
+
+14. edge mortar index for nonconforming edge.
+defined on edge
+range 1,2
+order:
+local edge 1 and 3:  left , right
+local edge 2 and 4: bottom, top 
+Note, for conforming edge it is always 1
+
+
+some important array:
+1. mortar index
+idmo(i,j,ii,jj,face,iel) gives the mortar index number. iel is element index,
+face if face index, (ii,jj) is mortar location, (i,j) is collocation point location on that piece of mortar. There 5  by 5 collocation point on each piece of 
+mortar. 
+
+Note, indices refers to local corners on each face are fixed no matter how many
+pieces mortar are there on each face. 
+for local corner 1, its mortar number is idmo(1  ,  1,1,1,face,iel)
+for local corner 2, its mortar number is idmo(LX1,  1,1,2,face,iel)
+for local corner 3, its mortar number is idmo(1  ,LX1,2,1,face,iel)
+for local corner 4, its mortar number is idmo(LX1,LX1,2,2,face,iel)
+
+2. sje(ii,jj,face,iel)
+records the neighor element index of iel, neighbored by face "face"
+ii,jj are defined the same as in  idmo()
+sje()=0 refers to no neighbor
+for conforming face, only when ii=1 and jj=1, sje() <>0
+
+3.ijel(n,face,iel)
+if iel's neighbor on face "face" is jel, face "face" on jel is jface, 
+then sje(ii,jj,jface,jel)=iel,
+then ijel(1,face,iel)=ii and ijel(2,face,iel)=jj
+
+4 Tree(iel) records the refinement history to get iel. When iel gets refined,
+its either children has the tree(children 1)=tree()+000, 
+tree(children 2)=tree()+001, 
+tree(children 3)=tree()+010, 
+tree(children 4)=tree()+011, 
+tree(children 5)=tree()+100, 
+tree(children 6)=tree()+101, 
+tree(children 7)=tree()+110, 
+tree(children 8)=tree()+111
+
+5. xc(i,iel) records the x coordinates of the i'th vertex of element iel
+   xc(i,iel) records the y coordinates of the i'th vertex of element iel
+   xc(i,iel) records the z coordinates of the i'th vertex of element iel
+
+6.cbc(i,iel) records the type of face i of element iel
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/adapt.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/adapt.f
new file mode 100644
index 0000000..a894ea2
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/adapt.f
@@ -0,0 +1,1160 @@
+c-----------------------------------------------------------
+      subroutine adaptation (ifmortar,step)
+c-----------------------------------------------------------
+c     For 3-D mesh adaptation (refinement+ coarsening)
+c-----------------------------------------------------------
+      include 'header.h'
+      
+      logical if_coarsen,if_refine,ifmortar,ifrepeat
+      integer iel,miel,irefine,icoarsen,neltold,step
+
+      if (timeron) call timer_start(t_adaptation)
+      ifmortar=.false.
+c.....compute heat source center(x0,y0,z0)
+      x0=x00+velx*time
+      y0=y00+vely*time
+      z0=z00+velz*time
+
+c.....Search elements to be refined. Check with restrictions. Perform
+c     refinement repeatedly until all desired refinements are done.
+
+c.....ich(iel)=0 no grid change on element iel
+c.....ich(iel)=2 iel is marked to be coarsened
+c.....ich(iel)=4 iel is marked to be refined
+
+c.....irefine records how many elements got refined
+      irefine=0
+
+c.....check whether elements need to be refined because they have overlap
+c     with the  heat source
+4     call find_refine(if_refine)
+
+      if(if_refine) then
+        ifrepeat=.true.
+2       if(ifrepeat) then
+c.........Check with restriction, unmark elements that cannot be refined.
+c         Elements preventing desired refinement will be marked to be refined.
+          call check_refine(ifrepeat) 
+          go to 2
+        end if
+c.......perform refinement
+        call do_refine(ifmortar,irefine)
+        goto 4
+      endif
+
+c.....Search for elements to be coarsened. Check with restrictions,
+c     Perform coarsening repeatedly until all possible coarsening
+c     is done.
+
+c.....icoarsen records how many elements got coarsened 
+      icoarsen=0
+
+c.....skip(iel)=.true. indicates an element no longer exists (because it
+c     got merged)
+      call l_init(skip,nelt,.false.)
+
+      neltold=nelt
+
+c.....Check whether elements need to be coarsened because they don't have
+c     overlap with the heat source. Only elements that don't have a larger 
+c     size neighbor can be marked to be coarsened
+
+5     call find_coarsen(if_coarsen,neltold)
+
+      if(if_coarsen) then
+c.......Perform coarsening, however subject to restriction. Only possible 
+c       coarsening will be performed. if_coarsen=.true. indicates that
+c       actual coarsening happened
+        call do_coarsen(if_coarsen,icoarsen,neltold)
+        if(if_coarsen) then
+c.........ifmortar=.true. indicates the grid changed, i.e. the mortar points 
+c         indices need to be regenerated on the new grid.
+          ifmortar=.true.
+          go to 5
+        end if 
+      end if
+
+      write(*,1000) step, irefine, icoarsen, nelt
+ 1000 format('Step ',i4, ': elements refined, merged, total:',
+     &       i6, 1X , i6, 1X, i6)
+
+c.....mt_to_id(miel) takes as argument the morton index  and returns the actual 
+c                    element index
+c.....id_to_mt(iel)  takes as argument the actual element index and returns the 
+c                    morton index
+      do miel=1,nelt
+        iel=mt_to_id(miel)
+        id_to_mt(iel)=miel
+      end do 
+
+c.....Reorder the elements in the order of the morton curve. After the move 
+c     subroutine the element indices are  the same as the morton indices
+      call move
+
+c.....if the grid changed, regenerate mortar indices and update variables
+c     associated to grid.
+      if (ifmortar) then
+        call mortar
+        call prepwork
+      endif
+      if (timeron) call timer_stop(t_adaptation)
+
+      return
+      end 
+
+
+c-----------------------------------------------------------
+      subroutine do_coarsen(if_coarsen,icoarsen,neltold)
+c---------------------------------------------------------------
+c     Coarsening procedure: 
+c     1) check with restrictions
+c     2) perform coarsening
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      logical if_coarsen, icheck,test,test1,test2,test3
+      integer iel, ntp(8), ntempmin, ic, parent, mielnew, miel,
+     &        icoarsen, ix, i, index, num_coarsen, ntemp, ii, ntemp1, 
+     &        neltold
+      
+      if_coarsen=.false.
+
+c.....If an element has been merged, it will be skipped afterwards
+c     skip(iel)=.true. for elements that will be skipped.
+c     ifcoa_id(iel)=.true. indicates that element iel will be coarsened
+c     ifcoa(miel)=.true. refers to element miel(mortar index) will be
+c                        coarsened
+
+      call ncopy(mt_to_id_old,mt_to_id,nelt)
+      call nr_init(mt_to_id,nelt,0)
+      call l_init(ifcoa_id,neltold,.false.)
+
+c.....Check whether the potential coarsening will make neighbor, 
+c     and neighbor's neighbor....break grid restriction
+
+      do miel=1,nelt
+        ifcoa(miel)=.false.
+        front(miel)=0
+        iel=mt_to_id_old(miel)
+c.......if an element is marked to be coarsened
+        if(ich(iel).eq.2) then
+
+c.........If the current  element is the "first" child (front-left-
+c         bottom) of its parent (tree(iel) mod 8 equals 0), then 
+c         find all its neighbors. Check whether they are from the same 
+c         parent.
+
+          ic=tree(iel)
+          if(.not.btest(ic,0).and..not.btest(ic,1).and.
+     &       .not.btest(ic,2)) then
+            ntp(1)=iel
+            ntp(2)=sje(1,1,1,iel)
+            ntp(3)=sje(1,1,3,iel)
+            ntp(4)=sje(1,1,1,ntp(3))
+            ntp(5)=sje(1,1,5,iel)
+            ntp(6)=sje(1,1,1,ntp(5))
+            ntp(7)=sje(1,1,3,ntp(5))
+            ntp(8)=sje(1,1,1,ntp(7))
+ 
+            parent=ishft(tree(iel),-3)
+            test=.false.
+
+            test1=.true.
+            do i=1,8
+              if(ishft(tree(ntp(i)),-3).ne.parent)test1=.false.
+            end do
+
+c...........check whether all child elements are marked to be coarsened
+            if(test1)then
+              test2=.true.
+              do i=1,8
+                if(ich(ntp(i)).ne.2)test2=.false.
+              end do
+
+c.............check whether all child elements can be coarsened or not.
+              if(test2)then
+                test3=.true.
+                do i=1,8
+                  if(.not.icheck(ntp(i),i))test3=.false.
+                end do
+                if(test3)test=.true.
+              end if
+            end if
+c...........if the eight child elements are eligible to be coarsened
+c           mark the first children ifcoa(miel)=.true.
+c           mark them all ifcoa_id()=.true.
+c           front(miel) will be used to calculate (potentially in parallel) 
+c                       how many elements with seuqnece numbers less than
+c                       miel will be coarsened.
+c           skip()      marks that an element will no longer exist after merge.
+
+            if(test)then
+
+              ifcoa(miel)=.true.
+              do i=1,8
+                ifcoa_id(ntp(i))=.true.
+              end do
+              front(miel)=1
+              do i=1,7
+                 skip(ntp(i+1))=.true.
+              end do
+              if_coarsen=.true.
+            end if
+          end if 
+        end if 
+      end do 
+
+c.....compute front(iel), how many elements will be coarsened before iel
+c     (including iel)
+      call parallel_add(front)
+
+c.....num_coarsen is the total number of elements that will be coarsened
+      num_coarsen=front(nelt)
+
+c.....action(i) records the morton index of the i'th element (if it is an
+c     element's front-left-bottom-child) to be coarsened.
+
+c.....create array mt_to_id to convert actual element index to morton index
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(.not.skip(iel))then
+          if(ifcoa(miel))then
+            action(front(miel))=miel
+            mielnew=miel-(front(miel)-1)*7
+          else 
+            mielnew=miel-front(miel)*7
+          end if
+          mt_to_id(mielnew)=iel
+        end if
+      end do
+
+c.....perform the coarsening procedure (potentially in parallel)
+      do index=1,num_coarsen
+        miel=action(index)
+        iel=mt_to_id_old(miel)
+c.......find eight child elements to be coarsened
+        ntp(1)=iel
+        ntp(2)=sje(1,1,1,iel)
+        ntp(3)=sje(1,1,3,iel)
+        ntp(4)=sje(1,1,1,ntp(3))
+        ntp(5)=sje(1,1,5,iel)
+        ntp(6)=sje(1,1,1,ntp(5))
+        ntp(7)=sje(1,1,3,ntp(5))
+        ntp(8)=sje(1,1,1,ntp(7))
+c.......merge them to be the parent
+        call merging(ntp)
+      end do
+      nelt=nelt-num_coarsen*7
+      icoarsen=icoarsen+num_coarsen*8
+
+      return
+      end
+
+c-------------------------------------------------------
+      subroutine do_refine(ifmortar,irefine)
+c-------------------------------------------------------
+c     Refinement procedure
+c--------------------------------------------------------
+
+      include 'header.h'
+
+      logical ifmortar
+      double precision xctemp(8), yctemp(8), zctemp(8), xleft, xright,
+     &       yleft, yright, zleft, zright, ta1temp(lx1,lx1,lx1),
+     &       xhalf, yhalf, zhalf
+      integer iel, i, ii, jj, j, jface, 
+     &        ntemp, ndir, facedir, k, le(4), ne(4), mielnew,
+     &        miel, irefine,ntemp1, num_refine, index, treetemp,
+     &        ijeltemp(2,6), sjetemp(2,2,6), n1, n2, nelttemp,
+     &        cb, cbctemp(6)
+
+c.....initialize
+
+      call ncopy(mt_to_id_old,mt_to_id,nelt)
+      call nr_init(mt_to_id,nelt,0)
+      call nr_init(action,nelt,0)
+      do miel=1,nelt
+        if(ich(mt_to_id_old(miel)).ne.4)then
+          front(miel)=0
+        else
+          front(miel)=1
+        end if
+      end do
+
+c.....front(iel) records how many elements with sequence numbers less than
+c     or equal to iel will be refined
+      call parallel_add(front)
+
+c.....num_refine is the total number of elements that will be refined
+      num_refine=front(nelt)
+
+c.....action(i) records the morton index of the  i'th element to be refined
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          action(front(miel))=miel
+        end if
+      end do
+
+c.....Compute array mt_to_id to convert the element index to morton index.
+c     ref_front_id(iel) records how many elements with index less than
+c     iel (actual element index, not morton index), will be refined.
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          ntemp=(front(miel)-1)*7
+          mielnew=miel+ntemp
+        else
+          ntemp=front(miel)*7
+          mielnew=miel+ntemp
+        end if
+
+        mt_to_id(mielnew)=iel
+        ref_front_id(iel)=nelt+ntemp
+      end do
+
+
+c.....Perform refinement (potentially in parallel): 
+c       - Cut an element into eight children.
+c       - Assign them element index  as iel, nelt+1,...., nelt+7.
+c       - Update neighboring information.
+
+      nelttemp=nelt
+
+      if (num_refine .gt. 0) then
+        ifmortar=.true.
+      endif
+
+      do index=1, num_refine  
+c.......miel is old morton index and mielnew is new morton index after refinement.
+        miel=action(index)
+        mielnew=miel+(front(miel)-1)*7
+        iel=mt_to_id_old(miel) 
+        nelt=nelttemp+(front(miel)-1)*7 
+c.......save iel's information in a temporary array
+        treetemp=tree(iel)
+        call copy(xctemp,xc(1,iel),8)
+        call copy(yctemp,yc(1,iel),8)
+        call copy(zctemp,zc(1,iel),8)
+        call ncopy(cbctemp,cbc(1,iel),6)
+        call ncopy(ijeltemp,ijel(1,1,iel),12)
+        call ncopy(sjetemp,sje(1,1,1,iel),24)
+        call copy(ta1temp,ta1(1,1,1,iel),nxyz)
+
+c.......zero out iel here
+        
+        tree(iel)=0
+        call nr_init(cbc(1,iel),6,0)
+        call nr_init(sje(1,1,1,iel),24,0)
+        call nr_init(ijel(1,1,iel),12,0)
+        call r_init(ta1(1,1,1,iel),nxyz,0.d0)
+
+c.......initialize new child elements:iel and nelt+1~nelt+7
+        do j=1,7
+          mt_to_id(mielnew+j)=nelt+j
+          tree(nelt+j)=0
+          call nr_init(cbc(1,nelt+j),6,0)
+          call nr_init(sje(1,1,1,nelt+j),24,0)
+          call nr_init(ijel(1,1,nelt+j),12,0)
+          call r_init(ta1(1,1,1,nelt+j),nxyz,0.d0)
+        end do
+          
+c.......update the tree()
+        ntemp=ishft(treetemp,3)
+        tree(iel)=ntemp
+        do i=1,7
+          tree(nelt+i)=ntemp+mod(i,8)
+        end do   
+c.......update the children's vertices' coordinates
+        xhalf=xctemp(1)+(xctemp(2)-xctemp(1))/2.d0
+        xleft=xctemp(1)
+        xright=xctemp(2)
+        yhalf=yctemp(1)+(yctemp(3)-yctemp(1))/2.d0
+        yleft=yctemp(1)
+        yright=yctemp(3)
+        zhalf=zctemp(1)+(zctemp(5)-zctemp(1))/2.d0
+        zleft=zctemp(1)
+        zright=zctemp(5)
+       
+        do j=1,7,2
+          do i=1,7,2
+            xc(i,nelt+j)     = xhalf
+            xc(i+1,nelt+j)   = xright 
+          end do
+        end do
+
+        do j=2,6,2
+          do i=1,7,2
+            xc(i,nelt+j)   = xleft
+            xc(i+1,nelt+j) = xhalf
+          end do
+        end do
+         
+        do i=1,7,2
+          xc(i,iel)=xleft
+          xc(i+1,iel)=xhalf
+        end do
+
+        do i=1,2
+          yc(i,nelt+1)=yleft
+          yc(i,nelt+4)=yleft
+          yc(i,nelt+5)=yleft
+          yc(i+4,nelt+1)=yleft
+          yc(i+4,nelt+4)=yleft
+          yc(i+4,nelt+5)=yleft
+        enddo
+        do i=3,4
+          yc(i,nelt+1)=yhalf
+          yc(i,nelt+4)=yhalf
+          yc(i,nelt+5)=yhalf
+          yc(i+4,nelt+1)=yhalf
+          yc(i+4,nelt+4)=yhalf
+          yc(i+4,nelt+5)=yhalf
+        end do
+        do j=2,3
+          do i=1,2
+            yc(i,nelt+j)=yhalf
+            yc(i,nelt+j+4)=yhalf
+            yc(i+4,nelt+j)=yhalf
+            yc(i+4,nelt+j+4)=yhalf
+          end do
+          do i=3,4
+            yc(i,nelt+j)=yright
+            yc(i,nelt+j+4)=yright
+            yc(i+4,nelt+j)=yright
+            yc(i+4,nelt+j+4)=yright
+          end do
+        end do
+          
+        do i=1,2
+          yc(i,iel)=yleft
+          yc(i+4,iel)=yleft
+        end do
+        do i=3,4
+          yc(i,iel)=yhalf
+          yc(i+4,iel)=yhalf
+        end do
+
+        do j=1,3
+          do i=1,4
+            zc(i,nelt+j)=zleft
+            zc(i+4,nelt+j)=zhalf
+          end do
+        end do
+        do j=4,7
+          do i=1,4
+            zc(i,nelt+j)=zhalf
+            zc(i+4,nelt+j)=zright
+          end do
+        end do
+        do i=1,4
+          zc(i,iel)=zleft
+          zc(i+4,iel)=zhalf
+        end do
+
+c.......update the children's neighbor information
+
+c.......ndir refers to the x,y,z directions, respectively.
+c       facedir refers to the orientation of the face in each direction, 
+c       e.g. ndir=1, facedir=0 refers to face 1,
+c       and ndir =1, facedir=1 refers to face 2.
+
+        do ndir = 1, 3
+          do facedir = 0, 1
+            i=2*ndir-1+facedir
+            jface=jjface(i)
+            cb=cbctemp(i)
+
+c...........find the new element indices of the four children on each
+c           face of the parent element
+            do k = 1, 4
+              le(k) = le_arr(k,facedir,ndir)+nelt
+              ne(k) = le_arr(k,1-facedir,ndir)+nelt
+            end do
+            if(facedir.eq.0)then
+              le(1)=iel
+            else
+              ne(1)=iel
+            end if
+c...........update neighbor information of the four child elements on each 
+c           face of the parent element
+            do k=1,4
+              cbc(i,le(k))=2
+              sje(1,1,i,le(k))=ne(k)
+              ijel(1,i,le(k))=1
+              ijel(2,i,le(k))=1
+            end do
+
+c...........if the face type of the parent element is type 2
+            if(cb.eq.2) then
+              ntemp=sjetemp(1,1,i)
+
+c.............if the neighbor ntemp is not marked to be refined
+              if(ich(ntemp).ne.4)then
+                cbc(jface,ntemp)=3
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+  
+                do k=1,4
+                  cbc(i,ne(k))=1
+                  sje(1,1,i,ne(k))=ntemp
+                  if(k.eq.1) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=1
+                    sje(1,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.2) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=2
+                    sje(1,2,jface,ntemp)=ne(k)
+                  elseif(k.eq.3) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=1
+                    sje(2,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.4) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=2
+                    sje(2,2,jface,ntemp)=ne(k)
+                  end if
+                end do
+
+c.............if the neighbor ntemp is also marked to be refined
+              else
+                n1=ref_front_id(ntemp)
+                 
+                do k=1,4
+                  cbc(i,ne(k))=2
+                  n2=n1+le_arr(k,facedir,ndir)
+                  if(n2.eq.n1+8)n2=ntemp
+                  sje(1,1,i,ne(k))=n2
+                  ijel(1,i,ne(k))=1
+                end do
+
+              endif
+c...........if the face type of the parent element is type 3
+            elseif(cb.eq.3) then
+              do k=1,4
+                cbc(i,ne(k))=2
+                if(k.eq.1) then
+                  ntemp=sjetemp(1,1,i)
+                elseif (k.eq.2) then
+                  ntemp=sjetemp(1,2,i)
+                elseif(k.eq.3) then
+                  ntemp=sjetemp(2,1,i)
+                elseif(k.eq.4) then
+                  ntemp=sjetemp(2,2,i)
+                end if
+                ijel(1,i,ne(k))=1
+                ijel(2,i,ne(k))=1
+                sje(1,1,i,ne(k))=ntemp
+                cbc(jface,ntemp)=2
+                sje(1,1,jface,ntemp)=ne(k)
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+              end do
+
+c...........if the face type of the parent element is type 0
+            elseif(cb.eq.0) then
+              do k=1,4
+                cbc(i,ne(k))=cb
+              end do
+            end if
+
+          end do 
+        end do 
+
+c.......map solution from parent element to children
+        call remap(ta1(1,1,1,iel),ta1(1,1,1,ref_front_id(iel)+1),
+     &             ta1temp(1,1,1))
+      end do
+
+      nelt=nelttemp+num_refine*7
+      irefine=irefine+num_refine
+      ntot=nelt*lx1*lx1*lx1
+      return
+      end
+
+c-----------------------------------------------------------
+       logical function ifcor(n1,n2,i,iface)
+c-----------------------------------------------------------
+c      returns whether element n1's face i and element n2's 
+c      jjface(iface) have intersections, i.e. whether n1 and 
+c      n2 are neighbored by an edge.
+c-----------------------------------------------------------
+
+       include 'header.h'
+
+       integer n1,n2,i,iface
+       logical ifsame
+
+       ifcor=.false.
+
+       if(ifsame(n1,e1v1(iface,i),n2,e2v1(iface,i)).or.
+     &    ifsame(n1,e1v2(iface,i),n2,e2v2(iface,i))) then
+          ifcor=.true.
+       end if
+
+       return
+       end
+
+c-----------------------------------------------------------
+      logical function icheck(ie,n)
+c-----------------------------------------------------------
+c     Check whether element ie's three faces (sharing vertex n)
+c     are nonconforming. This will prevent it from being coarsened.
+c     Also check ie's neighbors on those three faces, whether ie's
+c     neighbors by only an edge have a size smaller than ie's,
+c     which also prevents ie from being coarsened.
+c-----------------------------------------------------------
+      include 'header.h'
+
+      integer ie, n, iside, ntemp1, ntemp2, ntemp3, n1, n2, n3,
+     &        cb2_1,cb3_1,cb1_2,cb3_2,cb1_3,cb2_3
+
+      icheck=.true.
+      cb2_1=0
+      cb3_1=0
+      cb1_2=0
+      cb3_2=0
+      cb1_3=0
+      cb2_3=0
+
+      n1=f_c(1,n)
+      n2=f_c(2,n)
+      n3=f_c(3,n)
+      if((cbc(n1,ie).eq.3) .or. (cbc(n2,ie).eq.3) .or.
+     &   (cbc(n3,ie).eq.3)) then
+         icheck=.false.
+      else
+        ntemp1=sje(1,1,n1,ie)
+        ntemp2=sje(1,1,n2,ie)
+        ntemp3=sje(1,1,n3,ie)
+        if(ntemp1.ne.0)then
+           cb2_1=cbc(n2,ntemp1)
+           cb3_1=cbc(n3,ntemp1)
+        end if
+        if(ntemp2.ne.0)then
+           cb3_2=cbc(n3,ntemp2)
+           cb1_2=cbc(n1,ntemp2)
+        end if
+        if(ntemp3.ne.0)then
+           cb1_3=cbc(n1,ntemp3)
+           cb2_3=cbc(n2,ntemp3)
+        end if
+        if((cbc(n1,ie).eq.2.and.(cb2_1.eq.3.or.
+     &                               cb3_1.eq.3)).or.
+     &     (cbc(n2,ie).eq.2.and.(cb3_2.eq.3.or.
+     &                               cb1_2.eq.3)).or.
+     &     (cbc(n3,ie).eq.2.and.(cb1_3.eq.3.or.
+     &                              cb2_3.eq.3)))then
+          icheck=.false.
+        end if
+      end if
+
+      return
+      end 
+
+c-----------------------------------------------------------
+      subroutine find_coarsen(if_coarsen,neltold)
+c-----------------------------------------------------------
+c     Search elements to be coarsened. Check with restrictions.
+c     This subroutine only checks the element itself, not its
+c     neighbors.
+c-----------------------------------------------------------
+      
+      include 'header.h'
+
+      logical if_coarsen, iftemp, iftouch
+      integer iel,i,neltold
+
+      if_coarsen=.false.
+
+      do iel=1,neltold
+        if(.not.skip(iel))then
+          ich(iel)=0
+          if(.not.iftouch(iel)) then
+            iftemp=.false.
+            do i=1,nsides
+c.............if iel has a larger size than its face neighbors, it
+c             can not be coarsened
+              if(cbc(i,iel).eq.3) then
+                iftemp=.true.
+              endif
+            enddo
+            if(.not.iftemp) then
+              if_coarsen=.true.
+              ich(iel)=2
+            end if
+          end if
+        endif
+      enddo
+
+      return
+      end
+
+c-----------------------------------------------------------
+      subroutine find_refine(if_refine)
+c-----------------------------------------------------------
+c     search elements to be refined based on whether they
+c     have overlap with the heat source
+c-----------------------------------------------------------
+
+      include 'header.h'
+
+      logical if_refine, iftouch
+      integer iel
+
+      if_refine=.false.
+
+      do iel=1,nelt
+        ich(iel)=0
+        if(iftouch(iel)) then
+          if((xc(2,iel)-xc(1,iel)).gt.dlmin) then
+            if_refine=.true.
+            ich(iel)=4
+          end if
+        end if
+      enddo
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine check_refine(ifrepeat)
+c-----------------------------------------------------------------
+c     Check whether the potential refinement will violate the
+c     restriction. If so, mark the neighbor and unmark the
+c     original element, and set ifrepeat true. i.e. this procedure
+c     needs to be repeated until no further check is needed
+c-----------------------------------------------------------------
+
+      include 'header.h'
+ 
+      logical ifrepeat,ifcor
+      integer iel,iface,ntemp,nntemp,i,jface
+
+      ifrepeat=.false.
+
+      do iel=1,nelt
+c.......if iel is marked to be refined
+        if(ich(iel).eq.4) then
+c.........check its six faces
+          do i=1,nsides
+            jface=jjface(i)
+            ntemp=sje(1,1,i,iel)
+c...........if one face neighbor is larger in size than iel
+            if(cbc(i,iel).eq.1) then
+c.............unmark iel
+              ich(iel)=0
+c.............the large size neighbor ntemp is marked to be refined
+              if(ich(ntemp).ne.4) then
+                ifrepeat=.true.
+                ich(ntemp)=4
+              end if
+c.............check iel's neighbor, neighbored by an edge on face i, which
+c             must be a face neighbor of ntemp
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+c................if edge neighbors are larger than iel, mark them to be refined
+                  if(cbc(iface,ntemp).eq.2) then
+                    nntemp=sje(1,1,iface,ntemp)
+c..................ifcor is to make sure the edge neighbor exist
+                    if(ich(nntemp).ne.4.and.
+     &                 ifcor(iel,nntemp,i,iface))then
+                      ich(nntemp)=4
+                    end if
+                  end if
+                end if
+              end do
+c...........if face neighbor are of the same size of iel, check edge neighbors
+            elseif(cbc(i,iel).eq.2)then
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+                  if(cbc(iface,ntemp).eq.1)then
+                    nntemp=sje(1,1,iface,ntemp)
+                    ich(nntemp)=4
+                    ich(iel)=0
+                    ifrepeat=.true.
+                  end if
+                end if
+              end do
+            end if
+          enddo
+        end if
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      logical function iftouch(iel)
+c-----------------------------------------------------------------
+c     check whether element iel has overlap with the heat source
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dis, dis1, dis2, dis3, alpha2
+      integer iel
+
+      alpha2 = alpha*alpha
+
+      if     (x0 .lt. xc(1,iel)) then
+        dis1 = xc(1,iel) - x0
+      elseif (x0 .gt. xc(2,iel)) then
+        dis1 = x0 - xc(2,iel)
+      else
+        dis1 = 0.d0
+      endif
+
+      if     (y0 .lt. yc(1,iel)) then
+        dis2 = yc(1,iel) - y0
+      elseif (y0 .gt. yc(3,iel)) then
+        dis2 = y0 - yc(3,iel)
+      else
+        dis2 = 0.d0
+      endif
+
+      if     (z0 .lt. zc(1,iel)) then
+        dis3 = zc(1,iel) - z0
+      elseif (z0 .gt. zc(5,iel)) then
+        dis3 = z0 - zc(5,iel)
+      else
+       dis3 = 0.d0
+      endif
+
+      dis = dis1**2+dis2**2+dis3**2
+
+      if (dis .lt. alpha2) then
+       iftouch=.true.
+      else
+       iftouch=.false.
+      end if
+
+      return
+      end
+
+
+c-----------------------------------------------------------------
+      subroutine remap (y,y1,x) 
+c-----------------------------------------------------------------
+c     After a refinement, map the solution  from the parent (x) to
+c     the eight children. y is the solution on the first child
+c     (front-bottom-left) and y1 is the solution on the next 7 
+c     children.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x(lx1,lx1,lx1),y(lx1,lx1,lx1),y1(lx1,lx1,lx1,7),
+     &       yone(lx1,lx1,lx1,2), ytwo(lx1,lx1,lx1,4)
+      integer i, iz, ii, jj, kk
+
+      call r_init(y,lx1*lx1*lx1,0.d0)
+      call r_init(y1,lx1*lx1*lx1*7,0.d0)
+      call r_init(yone,lx1*lx1*lx1*2,0.d0)
+      call r_init(ytwo,lx1*lx1*lx1*4,0.d0)
+
+      do  i=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              yone(ii,jj,i,1) = yone(ii,jj,i,1) +ixmc1(ii,kk)*x(kk,jj,i)
+              yone(ii,jj,i,2) = yone(ii,jj,i,2) +ixmc2(ii,kk)*x(kk,jj,i)
+            end do
+          end do
+        end do
+
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              ytwo(ii,i,jj,1) = ytwo(ii,i,jj,1) + 
+     &                          yone(ii,kk,i,1)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,2) = ytwo(ii,i,jj,2) + 
+     &                          yone(ii,kk,i,1)*ixtmc2(kk,jj)
+              ytwo(ii,i,jj,3) = ytwo(ii,i,jj,3) + 
+     &                          yone(ii,kk,i,2)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,4) = ytwo(ii,i,jj,4) + 
+     &                          yone(ii,kk,i,2)*ixtmc2(kk,jj)
+            end do
+          end do
+        end do
+      end do
+
+      do  iz=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              y(ii,iz,jj) = y(ii,iz,jj) +
+     &                        ytwo(ii,kk,iz,1)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,1) = y1(ii,iz,jj,1) +
+     &                        ytwo(ii,kk,iz,3)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,2) = y1(ii,iz,jj,2) +
+     &                        ytwo(ii,kk,iz,2)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,3) = y1(ii,iz,jj,3) +
+     &                        ytwo(ii,kk,iz,4)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,4) = y1(ii,iz,jj,4) +
+     &                        ytwo(ii,kk,iz,1)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,5) = y1(ii,iz,jj,5) +
+     &                        ytwo(ii,kk,iz,3)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,6) = y1(ii,iz,jj,6) +
+     &                        ytwo(ii,kk,iz,2)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,7) = y1(ii,iz,jj,7) +
+     &                        ytwo(ii,kk,iz,4)*ixtmc2(kk,jj)            
+            end do
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+c=======================================================================
+      subroutine merging(iela)
+c-----------------------------------------------------------------------
+c     This subroutine is to merge the eight child elements and map 
+c     the solution from eight children to the  merged element. 
+c     iela array records the eight elements to be merged.
+c-----------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1,x2,y1,y2,z1,z2
+      integer ielnew,i,ntemp,jface,ii,cb,ntempa(4),iela(8),ielold,
+     &        ntema(4)
+
+      ielnew=iela(1)
+
+      tree(ielnew)=ishft(tree(ielnew),-3)   
+
+c.....element vertices 
+      x1=xc(1,iela(1))
+      x2=xc(2,iela(2))
+      y1=yc(1,iela(1))
+      y2=yc(3,iela(3))
+      z1=zc(1,iela(1))
+      z2=zc(5,iela(5))
+
+      do i=1,7,2
+        xc(i,ielnew)=x1
+      end do
+      do i=2,8,2
+        xc(i,ielnew)=x2
+      end do
+      do i=1,2
+        yc(i,ielnew)=y1
+        yc(i+4,ielnew)=y1
+      end do
+      do i=3,4
+        yc(i,ielnew)=y2
+        yc(i+4,ielnew)=y2
+      end do
+      do i=1,4
+        zc(i,ielnew)=z1
+      end do
+      do i=5,8
+        zc(i,ielnew)=z2
+      end do
+
+c.....update neighboring information
+      do i=1,nsides
+        jface=jjface(i)
+        ielold=iela(children(1,i))
+        do ii=1,4
+          ntempa(ii)=iela(children(ii,i))
+        end do
+
+        cb=cbc(i,ielold)
+       
+        if (cb.eq.2) then
+c.........if the neighbor elements also will be coarsened
+          if(ifcoa_id(sje(1,1,i,ielold)))then
+            if (i.eq.2 .or. i.eq. 4 .or. i.eq.6) then
+              ntemp=sje(1,1,i,sje(1,1,i,ntempa(1)))
+            else
+              ntemp=sje(1,1,i,ntempa(1))
+            end if 
+            sje(1,1,i,ielnew)=ntemp
+            ijel(1,i,ielnew)=1
+            ijel(2,i,ielnew)=1
+            cbc(i,ielnew)=2
+
+c.........if the neighbor elements will not be coarsened
+          else
+            do ii=1,4
+              ntema(ii)=sje(1,1,i,ntempa(ii)) 
+              cbc(jface,ntema(ii))=1
+              sje(1,1,jface,ntema(ii))=ielnew
+              ijel(1,jface,ntema(ii))=iijj(1,ii)
+              ijel(2,jface,ntema(ii))=iijj(2,ii)
+              sje(iijj(1,ii),iijj(2,ii),i,ielnew)=ntema(ii)
+              ijel(1,i,ielnew)=1
+              ijel(2,i,ielnew)=1
+            end do
+            cbc(i,ielnew)=3
+          end if       
+
+        else if(cb.eq.1)then
+
+          ntemp=sje(1,1,i,ielold)
+          cbc(jface,ntemp)=2
+          ijel(1,jface,ntemp)=1
+          ijel(2,jface,ntemp)=1
+          sje(1,1,jface,ntemp)=ielnew
+          sje(1,2,jface,ntemp)=0
+          sje(2,1,jface,ntemp)=0
+          sje(2,2,jface,ntemp)=0
+           
+          cbc(i,ielnew)=2
+          ijel(1,i,ielnew)=1
+          ijel(2,i,ielnew)=1
+          sje(1,1,i,ielnew)=ntemp
+         
+        else if(cb.eq.0)then
+          cbc(i,ielnew)=0
+          sje(1,1,i,ielnew)=0
+          sje(1,2,i,ielnew)=0
+          sje(2,1,i,ielnew)=0
+          sje(2,2,i,ielnew)=0
+        endif
+
+      end do
+
+c.....map solution from children to the merged element
+      call remap2(iela, ielnew)
+      
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine remap2(iela, ielnew)
+c-----------------------------------------------------------------
+c     Map the solution from the children to the parent.
+c     iela array records the eight elements to be merged.
+c     ielnew is the element index of the merged element.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+      integer iela(8), ielnew
+
+      double precision temp1(lx1,lx1,lx1),
+     &       temp2(lx1,lx1,lx1),temp3(lx1,lx1,lx1),temp4(lx1,lx1,lx1),
+     &       temp5(lx1,lx1,lx1),temp6(lx1,lx1,lx1)
+
+      call remapx(ta1(1,1,1,iela(1)),ta1(1,1,1,iela(2)),temp1)
+      call remapx(ta1(1,1,1,iela(3)),ta1(1,1,1,iela(4)),temp2)
+      call remapx(ta1(1,1,1,iela(5)),ta1(1,1,1,iela(6)),temp3)
+      call remapx(ta1(1,1,1,iela(7)),ta1(1,1,1,iela(8)),temp4)
+      call remapy(temp1,temp2,temp5)
+      call remapy(temp3,temp4,temp6)
+      call remapz(temp5,temp6,ta1(1,1,1,ielnew))
+
+      return
+      end       
+
+c-----------------------------------------------------------------
+      subroutine remapz(x1,x2,y)
+c-----------------------------------------------------------------
+c     z direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iy, ip
+
+      do iy=1,lx1
+        do ix=1,lx1
+          y(ix,iy,1)=x1(ix,iy,1)
+
+          y(ix,iy,2)=0.d0
+          do ip=1,lx1
+            y(ix,iy,2)=y(ix,iy,2)+map2(ip)*x1(ix,iy,ip)
+          end do
+
+          y(ix,iy,3)=x1(ix,iy,lx1)
+
+          y(ix,iy,4)=0.d0
+          do ip=1,lx1
+            y(ix,iy,4)=y(ix,iy,4)+map4(ip)*x2(ix,iy,ip)
+          end do
+
+          y(ix,iy,lx1)=x2(ix,iy,lx1)
+        end do
+      end do
+
+      return
+      end      
+
+c-----------------------------------------------------------------
+      subroutine remapy(x1,x2,y)
+c-----------------------------------------------------------------
+c     y direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iz, ip
+
+      do iz=1,lx1
+        do ix=1,lx1
+          y(ix,1,iz)=x1(ix,1,iz)
+
+          y(ix,2,iz)=0.d0
+          do ip=1,lx1
+            y(ix,2,iz)=y(ix,2,iz)+map2(ip)*x1(ix,ip,iz)
+          end do
+
+          y(ix,3,iz)=x1(ix,lx1,iz)
+
+          y(ix,4,iz)=0.d0
+          do ip=1,lx1
+            y(ix,4,iz)=y(ix,4,iz)+map4(ip)*x2(ix,ip,iz)
+          end do
+
+          y(ix,lx1,iz)=x2(ix,lx1,iz)
+        end do
+      end do
+
+      return
+      end      
+
+c-----------------------------------------------------------------
+      subroutine remapx(x1,x2,y)
+c-----------------------------------------------------------------
+c     x direction mapping after the merge.
+c     Map solution from x1 & x2 to y.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer iy, iz, ip
+
+      do iz=1,lx1
+        do iy=1,lx1
+          y(1,iy,iz)=x1(1,iy,iz)
+
+          y(2,iy,iz)=0.d0
+          do ip=1,lx1
+            y(2,iy,iz)=y(2,iy,iz)+map2(ip)*x1(ip,iy,iz)
+          end do
+
+          y(3,iy,iz)=x1(lx1,iy,iz)
+
+          y(4,iy,iz)=0.d0
+          do ip=1,lx1
+            y(4,iy,iz)=y(4,iy,iz)+map4(ip)*x2(ip,iy,iz)
+          end do
+
+          y(lx1,iy,iz)=x2(lx1,iy,iz)
+        end do
+      end do
+
+      return
+      end      
+       
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/convect.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/convect.f
new file mode 100644
index 0000000..b062223
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/convect.f
@@ -0,0 +1,219 @@
+c---------------------------------------------------------
+      subroutine convect(ifmortar)  
+c---------------------------------------------------------
+c     Advance the convection term using 4th order RK
+c     1.ta1 is solution from last time step 
+c     2.the heat source is considered part of d/dx
+c     3.trhs is right hand side for the diffusion equation
+c     4.tmor is solution on mortar points, which will be used
+c       as the initial guess when advancing the diffusion term 
+c---------------------------------------------------------
+
+      include 'header.h'
+
+      double precision alpha2, tempa(lx1,lx1,lx1), rdtime, pidivalpha, 
+     &       sixth, dtx1, dtx2, dtx3, src, rk1(lx1,lx1,lx1), 
+     &       rk2(lx1,lx1,lx1), rk3(lx1,lx1,lx1), rk4(lx1,lx1,lx1), 
+     &       temp(lx1,lx1,lx1), subtime(3), xx0(3), yy0(3), zz0(3), 
+     &       dtime2, r2, sum, xloc(lx1), yloc(lx1), zloc(lx1)
+      integer k,iel,i,j,iside,isize, substep, ip
+      logical ifmortar
+      parameter (sixth=1.d0/6.d0)
+
+      if (timeron) call timer_start(t_convect)
+      pidivalpha = dacos(-1.d0)/alpha
+      alpha2     = alpha*alpha
+      dtime2     = dtime/2.d0 
+      rdtime     = 1.d0/dtime
+      subtime(1) = time
+      subtime(2) = time+dtime2
+      subtime(3) = time+dtime
+      do substep = 1, 3
+        xx0(substep) = x00+velx*subtime(substep)
+        yy0(substep) = y00+vely*subtime(substep)
+        zz0(substep) = z00+velz*subtime(substep)
+      end do
+
+
+      do iel = 1, nelt
+        isize=size_e(iel)
+c.......xloc(i) is the location of i'th collocation in x direction in an element.
+c       yloc(i) is the location of j'th collocation in y direction in an element.
+c       zloc(i) is the location of k'th collocation in z direction in an element.
+        do i = 1, lx1
+          xloc(i) = xfrac(i)*(xc(2,iel)-xc(1,iel))+xc(1,iel)
+        end do
+        do j = 1, lx1
+          yloc(j) = xfrac(j)*(yc(4,iel)-yc(1,iel))+yc(1,iel)
+        end do
+        do k = 1, lx1
+          zloc(k) = xfrac(k)*(zc(5,iel)-zc(1,iel))+zc(1,iel)
+        end do
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(1))**2+(yloc(j)-yy0(1))**2+
+     &             (zloc(k)-zz0(1))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * ta1(ip,j,k,iel)
+              end do
+              dtx1 = -velx*sum*xrm1_s(i,j,k,isize)
+              sum  = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * ta1(i,ip,k,iel)
+              end do
+              dtx2=-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * ta1(i,j,ip,iel)
+              end do
+              dtx3=-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk1(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime2*rk1(i,j,k)
+
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk2(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=ta1(i,j,k,iel)+dtime2*rk2(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * tempa(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * tempa(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * tempa(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk3(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime*rk3(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(3))**2 + (yloc(j)-yy0(3))**2 +
+     &             (zloc(k)-zz0(3))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk4(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=sixth*(rk1(i,j,k)+2.d0*
+     &                   rk2(i,j,k)+2.d0*rk3(i,j,k)+rk4(i,j,k))
+            end do
+          end do
+        end do        
+
+c.......apply boundary condition
+        do iside=1,nsides
+          if(cbc(iside,iel).eq.0)then
+            call facev(tempa,iside,0.0d0)
+          end if
+        end do
+          
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              trhs(i,j,k,iel)=bm1_s(i,j,k,isize)*(ta1(i,j,k,iel)*rdtime+
+     &                        tempa(i,j,k))
+              ta1(i,j,k,iel)=ta1(i,j,k,iel)+tempa(i,j,k)*dtime
+            end do
+          end do
+        end do
+
+      end do 
+
+c.....get mortar for intial guess for CG
+
+      if (timeron) call timer_start(t_transfb_c)
+      if(ifmortar)then
+        call transfb_c_2(ta1)
+      else
+        call transfb_c(ta1)
+      end if
+      if (timeron) call timer_stop(t_transfb_c)
+
+      do i=1,nmor
+       tmort(i)=tmort(i)/mormult(i)
+      end do
+      if (timeron) call timer_stop(t_convect)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/diffuse.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/diffuse.f
new file mode 100644
index 0000000..3cab86c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/diffuse.f
@@ -0,0 +1,214 @@
+c---------------------------------------------------------------------
+      subroutine diffusion(ifmortar)      
+c---------------------------------------------------------------------
+c     advance the diffusion term using CG iterations
+c---------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision  rho_aux, rho1, rho2, beta, cona
+      logical ifmortar
+      integer iter,ie, im,iside,i,j,k
+
+      if (timeron) call timer_start(t_diffusion)
+c.....set up diagonal preconditioner
+      if (ifmortar) then
+        call setuppc
+        call setpcmo
+      end if
+
+c.....arrays t and umor are accumlators of (am pm) in the CG algorithm
+c     (see the specification)
+
+      call r_init(t,ntot,0.d0)
+      call r_init(umor,nmor,0.d0)
+
+c.....calculate initial am (see specification) in CG algorithm
+
+c.....trhs and rmor are combined to generate r0 in CG algorithm.
+c     pdiff and pmorx are combined to generate q0 in the CG algorithm.
+c     rho1 is  (qm,rm) in the CG algorithm.
+
+      rho1 = 0.d0
+      do ie=1,nelt
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              pdiff(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+              rho1            = rho1 + trhs(i,j,k,ie)*pdiff(i,j,k,ie)*
+     &                                          tmult(i,j,k,ie)
+            end do
+          end do
+        end do
+      end do
+
+      do im = 1, nmor
+        pmorx(im) = dpcmor(im)*rmor(im)
+        rho1      = rho1 + rmor(im)*pmorx(im)
+      end do
+
+c.................................................................
+c     commence conjugate gradient iteration
+c.................................................................
+
+      do iter=1, nmxh
+        if(iter.gt.1) then 
+          rho_aux = 0.d0
+c.........pdiffp and ppmor are combined to generate q_m+1 in the specification
+c         rho_aux is (q_m+1,r_m+1)
+          do ie = 1, nelt
+            do k=1,lx1
+              do j=1,lx1
+                do i=1,lx1
+                  pdiffp(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+                  rho_aux =rho_aux+trhs(i,j,k,ie)*pdiffp(i,j,k,ie)*
+     &                                            tmult(i,j,k,ie)
+                end do
+              end do
+            end do
+          end do
+
+          do im = 1, nmor
+            ppmor(im) = dpcmor(im)*rmor(im)
+            rho_aux = rho_aux + rmor(im)*ppmor(im)
+          end do
+
+c.........compute bm (beta) in the specification
+          rho2 = rho1
+          rho1 = rho_aux
+          beta = rho1/rho2
+c.........update p_m+1 in the specification
+          call adds1m1(pdiff, pdiffp, beta,ntot)
+          call adds1m1(pmorx, ppmor,  beta, nmor)  
+        end if
+ 
+c.......compute matrix vector product: (theta pm) in the specification
+
+        if (timeron) call timer_start(t_transf)
+        call transf(pmorx,pdiff) 
+        if (timeron) call timer_stop(t_transf)
+
+c.......compute pdiffp which is (A theta pm) in the specification
+        do ie=1, nelt
+          call laplacian(pdiffp(1,1,1,ie),pdiff(1,1,1,ie),size_e(ie))
+        end do
+
+c.......compute ppmor which will be used to compute (thetaT A theta pm) 
+c       in the specification
+        if (timeron) call timer_start(t_transfb)
+        call transfb(ppmor,pdiffp) 
+        if (timeron) call timer_stop(t_transfb)
+ 
+c.......apply boundary condition
+        do ie=1,nelt
+          do iside=1,nsides
+            if(cbc(iside,ie).eq.0)then
+              call facev(pdiffp(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+
+c.......compute cona which is (pm,theta T A theta pm)
+        cona = 0.d0
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                cona = cona + pdiff(i,j,k,ie)*
+     &                 pdiffp(i,j,k,ie)*tmult(i,j,k,ie)
+              end do 
+             end do 
+          end do 
+        end do 
+
+        do im = 1, nmor
+          ppmor(im) = ppmor(im)*tmmor(im)
+          cona = cona + pmorx(im)*ppmor(im)
+        end do
+
+c.......compute am
+        cona = rho1/cona
+c.......compute (am pm)
+        call adds2m1(t,    pdiff,   cona, ntot)
+        call adds2m1(umor, pmorx,   cona, nmor) 
+c.......compute r_m+1
+        call adds2m1(trhs, pdiffp, -cona, ntot)
+        call adds2m1(rmor, ppmor,  -cona, nmor) 
+ 
+      end do
+
+      if (timeron) call timer_start(t_transf)
+      call transf(umor,t)  
+      if (timeron) call timer_stop(t_transf)
+      if (timeron) call timer_stop(t_diffusion)
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine laplacian(r,u,sizei)
+c------------------------------------------------------------------
+c     compute  r = visc*[A]x +[B]x on a given element.
+c------------------------------------------------------------------
+      include 'header.h'
+
+      double precision r(lx1,lx1,lx1), u(lx1,lx1,lx1), rdtime
+      integer i,j,k, ix,iz, sizei
+
+      double precision tm1(lx1,lx1,lx1),tm2(lx1,lx1,lx1)                     
+
+      rdtime = 1.d0/dtime
+
+      call r_init(tm1,nxyz,0.d0)
+
+      do iz=1,lx1                     
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm1(i,j,iz) = tm1(i,j,iz)+wdtdr(i,k)*u(k,j,iz)
+            end do
+          end do
+        end do                           
+      end do
+              
+      call r_init(tm2,nxyz,0.d0)                                                   
+      do iz=1,lx1                                            
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm2(i,j,iz) = tm2(i,j,iz)+u(i,k,iz)*wdtdr(k,j)
+            end do
+          end do
+        end do
+      end do
+                                                            
+      call r_init(r,nxyz,0.d0)   
+      do k = 1, lx1
+        do iz=1, lx1    
+          do j = 1, lx1
+            do i = 1, lx1
+              r(i,j,iz) = r(i,j,iz)+u(i,j,k)*wdtdr(k,iz)
+            end do
+          end do
+        end do
+      end do
+
+c.....collocate with remaining weights and sum to complete factorization.                   
+                                                      
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            r(i,j,k)=visc*(tm1(i,j,k)*g4m1_s(i,j,k,sizei)+
+     &                   tm2(i,j,k)*g5m1_s(i,j,k,sizei)+
+     &                    r(i,j,k)*g6m1_s(i,j,k,sizei))+
+     &               bm1_s(i,j,k,sizei)*rdtime*u(i,j,k)             
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                    
+
+
+ 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/header.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/header.h
new file mode 100644
index 0000000..df3f26b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/header.h
@@ -0,0 +1,182 @@
+      implicit none
+
+      include 'npbparams.h'
+
+c.....Array dimensions     
+      integer lx1, lnje, nsides, nxyz
+      parameter(lx1=5, lnje=2, nsides=6,  nxyz=lx1*lx1*lx1)
+
+      integer fre, niter, nmxh
+      double precision alpha, dlmin, dtime
+      common /usrdati/ fre, niter, nmxh
+      common /usrdatr/ alpha, dlmin, dtime
+
+      integer nelt, ntot, nmor, nvertex
+      common /dimn/ nelt,ntot, nmor, nvertex
+
+      double precision x0, y0, z0, time
+      common /bench1/time, x0, y0, z0
+
+      double precision velx, vely, velz, visc, x00, y00, z00
+      parameter(velx=3.d0, vely=3.d0, velz=3.d0)
+      parameter(visc=0.005d0)
+      parameter(x00=3.d0/7.d0, y00=2.d0/7.d0, z00=2.d0/7.d0)
+
+c.....double precision arrays associated with collocation points
+      double precision
+     &       ta1  (lx1,lx1,lx1,lelt), ta2   (lx1,lx1,lx1,lelt),
+     &       trhs (lx1,lx1,lx1,lelt), t     (lx1,lx1,lx1,lelt), 
+     &       tmult(lx1,lx1,lx1,lelt), dpcelm(lx1,lx1,lx1,lelt), 
+     &       pdiff(lx1,lx1,lx1,lelt), pdiffp(lx1,lx1,lx1,lelt)
+      common /colldp/ ta1, ta2, trhs, t, tmult, dpcelm, pdiff, pdiffp
+
+c.....double precision arays associated with mortar points
+      double precision
+     &       umor(lmor), mormult(lmor), tmort(lmor), tmmor(lmor), 
+     &       rmor(lmor), dpcmor (lmor), pmorx(lmor), ppmor(lmor) 
+      common /mortdp/ umor, mormult, tmort,tmmor, rmor, dpcmor, 
+     &                pmorx, ppmor
+
+c.... integer arrays associated with element faces
+      integer idmo    (lx1,lx1,lnje,lnje,nsides,lelt), 
+     &        idel    (lx1,lx1,          nsides,lelt), 
+     &        sje     (2,2,              nsides,lelt), 
+     &        sje_new (2,2,              nsides,lelt),
+     &        ijel    (2,                nsides,lelt), 
+     &        ijel_new(2,                nsides,lelt),
+     &        cbc     (                  nsides,lelt), 
+     &        cbc_new (                  nsides,lelt) 
+      common/facein/ idmo, ijel, idel, ijel_new, sje, sje_new, cbc,
+     &               cbc_new
+
+c.....integer array associated with vertices
+      integer vassign  (8,lelt),      emo(2,8,8*lelt),   
+     &        nemo (8*    lelt)
+      common /vin/vassign, emo, nemo
+
+c.....integer array associated with element edges
+      integer diagn  (2,12,lelt) 
+      common /edgein/diagn 
+
+c.... integer arrays associated with elements
+      integer tree (      lelt), mt_to_id    (     lelt),                   
+     &        newc (      lelt), mt_to_id_old(     lelt),
+     &        newi (      lelt), id_to_mt    (     lelt), 
+     &        newe (      lelt), ref_front_id(     lelt),
+     &        front(      lelt), action      (     lelt), 
+     &        ich  (      lelt), size_e      (     lelt),
+     &        treenew     (     lelt)
+      common /eltin/ tree, treenew,mt_to_id,mt_to_id_old,
+     &               id_to_mt, newc, newi, newe, ref_front_id, 
+     &               ich, size_e, front, action
+
+c.....logical arrays associated with vertices
+      logical ifpcmor  (8* lelt)
+      common /vlg/ ifpcmor
+
+c.....logical arrays associated with edge
+      logical eassign  (12,lelt),  if_1_edge(12,lelt), 
+     &        ncon_edge(12,lelt)
+      common /edgelg/ eassign,  ncon_edge, if_1_edge
+
+c.....logical arrays associated with elements
+      logical skip (lelt), ifcoa   (lelt), ifcoa_id(lelt)
+      common /facelg/ skip, ifcoa, ifcoa_id
+
+c.....logical arrays associated with element faces
+      logical fassign(nsides,lelt), edgevis(4,nsides,lelt)      
+      common /masonl/ fassign, edgevis
+
+c.....small arrays
+      double precision qbnew(lx1-2,lx1,2), bqnew(lx1-2,lx1-2,2)
+      common /transr/ qbnew,bqnew
+
+      double precision pcmor_nc1(lx1,lx1,2,2,refine_max),
+     $       pcmor_nc2(lx1,lx1,2,2,refine_max),
+     $       pcmor_nc0(lx1,lx1,2,2,refine_max),
+     $       pcmor_c(lx1,lx1,refine_max), tcpre(lx1,lx1),
+     $       pcmor_cor(8,refine_max)
+      common /pcr/ pcmor_nc1,pcmor_c,pcmor_nc0,pcmor_nc2,tcpre, 
+     $             pcmor_cor
+
+c.....gauss-labotto and gauss points
+      double precision zgm1(lx1)
+      common /gauss/ zgm1
+
+c.....weights
+      double precision wxm1(lx1), w3m1(lx1,lx1,lx1)
+      common /wxyz/ wxm1,w3m1
+
+c.....coordinate of element vertices
+      double precision xc(8,lelt),yc(8,lelt),zc(8,lelt),
+     $       xc_new(8,lelt),yc_new(8,lelt),zc_new(8,lelt)
+      common /coord/ xc,yc,zc,xc_new,yc_new,zc_new
+
+c.....dr/dx, dx/dr  and Jacobian
+      double precision jacm1_s(lx1,lx1,lx1,refine_max), 
+     $       rxm1_s(lx1,lx1,lx1,refine_max),
+     $       xrm1_s(lx1,lx1,lx1,refine_max)
+      common /giso/ jacm1_s,xrm1_s, rxm1_s 
+
+c.....mass matrices (diagonal)
+      double precision bm1_s(lx1,lx1,lx1,refine_max)
+      common /mass/ bm1_s
+
+c.....dertivative matrices d/dr
+      double precision dxm1(lx1,lx1), dxtm1(lx1,lx1), wdtdr(lx1,lx1)
+      common /dxyz/ dxm1,dxtm1,wdtdr
+
+c.....interpolation operators
+      double precision
+     $       ixm31(lx1,lx1*2-1), ixtm31(lx1*2-1,lx1), ixmc1(lx1,lx1),  
+     $       ixtmc1(lx1,lx1), ixmc2(lx1,lx1),  ixtmc2(lx1,lx1),
+     $       map2(lx1),map4(lx1)
+      common /ixyz/ ixmc1,ixtmc1,ixmc2,ixtmc2,ixm31,ixtm31,map2,map4
+
+c.....collocation location within an element
+      double precision xfrac(lx1)
+      common /xfracs/xfrac
+
+c.....used in laplacian operator
+      double precision g1m1_s(lx1,lx1,lx1,refine_max), 
+     $       g4m1_s(lx1,lx1,lx1,refine_max),
+     $       g5m1_s(lx1,lx1,lx1,refine_max),
+     $       g6m1_s(lx1,lx1,lx1,refine_max)
+      common /gmfact/ g1m1_s,g4m1_s,g5m1_s, g6m1_s
+      
+c.....We store some tables of useful topological constants
+c     These constants are intialized in a block data 'top_constants'
+      integer f_e_ef(4,6)
+      integer e_c(3,8)
+      integer local_corner(8,6)
+      integer cal_nnb(3,8)
+      integer oplc(4)
+      integer cal_iijj(2,4)
+      integer cal_intempx(4,6)
+      integer c_f(4,6)
+      integer le_arr(4,0:1,3)
+      integer jjface(6)
+      integer e_face2(4,6)
+      integer op(4)
+      integer localedgenumber(6,12)
+      integer edgenumber(4,6)
+      integer f_c(3,8)
+      integer e1v1(6,6),e2v1(6,6),e1v2(6,6),e2v2(6,6)
+      integer children(4,6)
+      integer iijj(2,4)
+      integer v_end(2)
+      integer face_l1(3),face_l2(3),face_ld(3)
+      common /top_consts/ f_e_ef,e_c,local_corner,cal_nnb,oplc,
+     $       cal_iijj,cal_intempx,c_f,le_arr,jjface,e_face2,op,
+     $       localedgenumber,edgenumber,f_c,e1v1,e2v1,e1v2,e2v2,
+     $       children,iijj,v_end,face_l1,face_l2,face_ld
+
+c ... Timer parameters
+      integer t_total,t_init,t_convect,t_transfb_c,
+     &        t_diffusion,t_transf,t_transfb,t_adaptation,
+     &        t_transf2,t_add2,t_last
+      parameter (t_total=1,t_init=2,t_convect=3,t_transfb_c=4,
+     &        t_diffusion=5,t_transf=6,t_transfb=7,t_adaptation=8,
+     &        t_transf2=9,t_add2=10,t_last=10)
+      logical timeron
+      common /timing/timeron
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/mason.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/mason.f
new file mode 100644
index 0000000..15fc53c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/mason.f
@@ -0,0 +1,2255 @@
+c-----------------------------------------------------------------
+      subroutine mortar
+c-----------------------------------------------------------------
+c     generate mortar point index number 
+c-----------------------------------------------------------------
+      include 'header.h'
+
+      integer count, iel, jface, ntemp, i, ii, jj, ntemp1,
+     &        iii, jjj, face2, ne, ie, edge_g, ie2,
+     &        mor_v(3), cb, cb1, cb2, cb3, cb4, cb5, cb6,
+     &        space, sumcb, ij1, ij2, n1, n2, n3, n4, n5
+
+
+      n1=lx1*lx1*6*4*nelt
+      call nr_init(idmo,n1,0)
+
+      n2=8*nelt
+      call nr_init(nemo,n2,0)
+      call nr_init(vassign,n2,0)
+
+      n3=2*64*nelt
+      call nr_init(emo,n3,0)
+
+      n4=12*nelt
+      call l_init(if_1_edge,n4,.false.)
+  
+      n5=2*12*nelt
+      call nr_init(diagn,n5,0) 
+
+c.....Mortar points indices are generated in two steps: first generate 
+c     them for all element vertices (corner points), then for conforming 
+c     edge and conforming face interiors. Each time a new mortar index 
+c     is generated for a mortar point, it is broadcast to all elements 
+c     sharing this mortar point. 
+
+c.....VERTICES
+      count=0
+
+c.....assign mortar point indices to element vertices
+
+      do iel=1,nelt
+
+c.......first calculate how many new mortar indices will be generated for 
+c       each element.
+
+c.......For each element, at least one vertex (vertex 8) will be new mortar
+c       point. All possible new mortar points will be on face 2,4 or 6. By
+c       checking the type of these three faces, we are able to tell
+c       how many new mortar vertex points will be generated in each element.
+
+        cb=cbc(6,iel)
+        cb1=cbc(4,iel)
+        cb2=cbc(2,iel)
+
+c.......For different combinations of the type of these three faces,
+c       we group them into 27 configurations.
+c       For different face types we assign the following integers:
+c              1 for type 2 or 3
+c              2 for type 0
+c              5 for type 1
+c       By summing these integers for faces 2,4 and 6, sumcb will have 
+c       10 different numbers indicating 10 different combinations. 
+
+        sumcb=0
+        if(cb.eq.2.or.cb.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb1.eq.2.or.cb1.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb1.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb1.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb2.eq.2.or.cb2.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb2.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb2.eq.1)then
+          sumcb=sumcb+5
+        end if
+
+c.......compute newc(iel)
+c       newc(iel) records how many new mortar indices will be generated
+c                 for element iel
+c       vassign(i,iel) records the element vertex of the i'th new mortar 
+c                 vertex point for element iel. e.g. vassign(2,iel)=8 means
+c                 the 2nd new mortar vertex point generated on element
+c                 iel is iel's 8th vertex.
+ 
+        if(sumcb.eq.3)then
+c.......the three face types for face 2,4, and 6 are 2 2 2
+          newc(iel)=1
+          vassign(1,iel)=8
+          
+        elseif(sumcb.eq.4)then
+c.......the three face types for face 2,4 and 6 are 2 2 0 (not 
+c       necessarily in this order)
+          newc(iel)=2
+          if(cb.eq.0)then
+            vassign(1,iel)=4
+          elseif(cb1.eq.0)then
+            vassign(1,iel)=6
+          elseif(cb2.eq.0)then
+            vassign(1,iel)=7
+          end if
+          vassign(2,iel)=8
+
+        elseif(sumcb.eq.7)then
+c.......the three face types for face 2,4 and 6 are 2 2 1 (not 
+c       necessarily in this order)
+          if(cb.eq.1)then
+            ij1=ijel(1,6,iel)
+            ij2=ijel(2,6,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=4
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              end if
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          elseif(cb1.eq.1)then
+            ij1=ijel(1,4,iel)
+            ij2=ijel(2,4,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=6
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+
+          elseif(cb2.eq.1)then
+            ij1=ijel(1,2,iel)
+            ij2=ijel(2,2,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=7
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          end if
+
+        elseif(sumcb.eq.5)then
+c.......the three face types for face 2,4 and 6 are 2/3 0 0 (not 
+c       necessarily in this order)
+          newc(iel)=4
+          if(cb.eq.2.or.cb.eq.3)then
+            vassign(1,iel)=5
+            vassign(2,iel)=6
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            vassign(1,iel)=3
+            vassign(2,iel)=4
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            vassign(1,iel)=2
+            vassign(2,iel)=4
+            vassign(3,iel)=6
+            vassign(4,iel)=8
+          end if
+
+        elseif(sumcb.eq.8)then
+c.......the three face types for face 2,4 and 6 are 2 0 1 (not 
+c       necessarily in this order)
+
+c.........if face 2 of type 1
+          if(cb.eq.1)then
+            if(cb1.eq.2.or.cb1.eq.3)then
+              ij1=ijel(1,6,iel)
+              if(ij1.eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else 
+                ntemp=sje(1,1,6,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+
+            elseif(cb2.eq.2.or.cb2.eq.3)then
+              if(ijel(2,6,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,6,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+c.........if face 4 of type 1
+          elseif(cb1.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              ij1=ijel(1,4,iel)
+              ij2=ijel(2,4,iel)
+
+              if(ij1.eq.1.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.1.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,4,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            endif
+c.........if face 6 of type 1
+          elseif(cb2.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              if(ijel(1,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          end if
+
+        elseif(sumcb.eq.11)then
+c.......the three face type for face 2,4 and 6 are 2 1 1(not 
+c       necessarily in this order)
+          if(cb.eq.2.or.cb.eq.3)then
+            if(ijel(1,4,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=6
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+
+c...........if ijel(1,4,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            if(ijel(2,2,iel).eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+c...........if ijel(2,2,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            if(ijel(2,6,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              end if
+c...........if ijel(2,6,iel)=2
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.
+     &            sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.
+     &              sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+          end if
+          
+        elseif(sumcb.eq.6)then
+c.......the three face type for face 2,4 and 6 are 0 0 0(not 
+c       necessarily in this order)
+          newc(iel)=8
+          vassign(1,iel)=1
+          vassign(2,iel)=2
+          vassign(3,iel)=3
+          vassign(4,iel)=4
+          vassign(5,iel)=5
+          vassign(6,iel)=6
+          vassign(7,iel)=7
+          vassign(8,iel)=8
+
+        elseif(sumcb.eq.9)then
+c.......the three face type for face 2,4 and 6 are 0 0 1(not 
+c       necessarily in this order)
+          newc(iel)=7
+          vassign(1,iel)=2
+          vassign(2,iel)=3
+          vassign(3,iel)=4
+          vassign(4,iel)=5
+          vassign(5,iel)=6
+          vassign(6,iel)=7
+          vassign(7,iel)=8
+
+        elseif(sumcb.eq.12)then
+c.......the three face type for face 2,4 and 6 are 0 1 1(not 
+c       necessarily in this order)
+          if(cb.eq.0)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(4,ntemp).eq.3.and.sje(1,1,4,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          elseif(cb1.eq.0)then
+            newc(iel)=7
+            vassign(1,iel)=2
+            vassign(2,iel)=3
+            vassign(3,iel)=4
+            vassign(4,iel)=5
+            vassign(5,iel)=6
+            vassign(6,iel)=7
+            vassign(7,iel)=8
+          elseif(cb2.eq.0)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=3
+              vassign(2,iel)=4
+              vassign(3,iel)=5
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          end if
+        
+        elseif(sumcb.eq.15)then
+c.......the three face type for face 2,4 and 6 are 1 1 1(not 
+c       necessarily in this order)
+          ntemp=sje(1,1,4,iel)
+          ntemp1=sje(1,1,2,iel)
+          if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=4
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                newc(iel)=5
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=4
+                vassign(2,iel)=5
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            end if
+          else
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=3
+                vassign(3,iel)=4
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+
+              else
+                newc(iel)=7
+                vassign(1,iel)=2 
+                vassign(2,iel)=3 
+                vassign(3,iel)=4 
+                vassign(4,iel)=5
+                vassign(5,iel)=6
+                vassign(6,iel)=7
+                vassign(7,iel)=8
+              end if
+            end if
+          end if
+        end if
+      end do
+
+c.....end computing how many new mortar vertex points will be generated
+c     on each element.
+
+c.....Compute (potentially in parallel) front(iel), which records how many 
+c     new mortar point indices are to be generated from element 1 to iel.
+c     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+      call ncopy(front,newc,nelt)
+
+      call parallel_add(front)
+
+c.....On each element, generate new mortar point indices and assign them
+c     to all elements sharing this mortar point. Note, if a mortar point 
+c     is shared by several elements, the mortar point index of it will only
+c     be generated on the element with the lowest element index. 
+
+      do iel=1,nelt
+
+c.......compute the starting vertex mortar point index in element iel
+        front(iel)=front(iel)-newc(iel)
+
+        do i=1,newc(iel)
+c.........count is the new mortar index number, which will be assigned
+c         to a vertex of iel and broadcast to all other elements sharing
+c         this vertex point.
+          count=front(iel)+i
+          call mortar_vertex(vassign(i,iel),iel,count) 
+        end do
+      end do
+
+c.....nvertex records how many mortar indices are for element vertices.
+c     It is used in the computation of the preconditioner.
+      nvertex=count
+
+c.....CONFORMING EDGE AND FACE INTERIOR
+
+c.....find out how many new mortar point indices will be assigned to all
+c.....conforming edges and all conforming face interiors on each element
+
+
+c.....eassign(i,iel)=.true.   indicates that the i'th edge on iel will 
+c                             generate new mortar points. 
+c     ncon_edge(i,iel)=.true. indicates that the i'th edge on iel is 
+c                             nonconforming
+
+      n1=12*nelt
+      call l_init(ncon_edge,n1,.false.)
+      call l_init(eassign,n1,.false.)
+
+c.....fassign(i,iel)=.true. indicates that the i'th face of iel will 
+c                           generate new mortar points
+      n2=6*nelt
+      call l_init(fassign,n2,.false.)
+
+c.....newe records how many new edges are to be assigned
+c     diagn(1,n,iel) records the element index of neighbor element of iel,
+c                    that shares edge n of iel
+c     diagn(2,n,iel) records the neighbor element diagn(1,n,iel) shares which
+c                    part of edge n of iel. diagn(2,n,iel)=1 refers to left
+c                    or bottom half of the edge n, diagn(2,n,iel)=2 refers
+c                    to the right or top part of edge n.
+c     if_1_edge(n,iel)=.true. indicates that the size of iel is smaller than 
+c                    that of its neighbor connected, neighbored by edge n only
+
+
+      do iel=1,nelt
+        newc(iel)=0
+        newe(iel)=0
+        newi(iel)=0
+        cb1=cbc(1,iel)
+        cb2=cbc(2,iel)
+        cb3=cbc(3,iel)
+        cb4=cbc(4,iel)
+        cb5=cbc(5,iel)
+        cb6=cbc(6,iel)
+
+c.......on face 6
+
+        if(cb6.eq.0)then
+          if(cb4.eq.0.or.cb4.eq.1)then
+c...........if face 6 is of type 0 and face 4 is of type 0 or type 1, the edge
+c           shared by face 4 and 6 (edge 11) will generate new mortar point
+c           indices.
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          end if
+          if(cb1.ne.3)then
+c...........if face 1 is of type 3, the edge shared by face 6 and 1 (edge 1)
+c           will generate new mortar points indices.
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          end if
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          end if
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          end if
+        elseif(cb6.eq.1)then
+          if(cb4.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          elseif(cb4.eq.1)then
+
+c...........If face 6 and face 4 both are of type 1, ntemp is the neighbor
+c           element on face 4.
+            ntemp=sje(1,1,4,iel)
+
+c...........if ntemp's face 6 is not noncoforming or the neighbor element
+c           of ntemp on face 6 has an element index larger than iel, the 
+c           edge shared by face 6 and 4 (edge 11) will generate new mortar
+c           point indices.
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+
+              newe(iel)=newe(iel)+1
+              eassign(11,iel)=.true.
+c.............if the face 6 of ntemp is of type 2
+              if(cbc(6,ntemp).eq.2)then
+c...............The neighbor element of iel, neighbored by edge 11, is 
+c               sje(1,1,6,ntemp) (the neighbor element of ntemp on ntemp's
+c               face 6).
+                diagn(1,11,iel)=sje(1,1,6,ntemp)
+c...............The neighbor element of iel, neighbored by edge 11 shares
+c               the ijel(2,6,iel) part of edge 11 of iel
+                diagn(2,11,iel)=ijel(2,6,iel)
+c...............edge 10 of element sje(1,1,6,ntemp) (the neighbor element of 
+c               ntemp on ntemp's face 6) is a nonconforming edge
+                ncon_edge(10,sje(1,1,6,ntemp))=.true.
+c...............if_1_edge(n,iel)=.true. indicates that iel is of a smaller
+c               size than its neighbor element, neighbored by edge n of iel only.
+                if_1_edge(11,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,11,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          endif
+
+          if(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,1,iel)=sje(1,1,6,ntemp)
+                diagn(2,1,iel)=ijel(1,6,iel)
+                ncon_edge(7,sje(1,1,6,ntemp))=.true.
+                if_1_edge(1,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,1,iel)=sje(ijel(1,6,iel),1,6,ntemp)
+              endif
+            end if
+          elseif(cb1.eq.2)then
+            if(ijel(2,6,iel).eq.2)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(1,iel)=.true.
+c.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(1,iel)=.true.
+                  diagn(1,1,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+            end if
+          end if
+
+          if(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,9,iel)=sje(1,1,6,ntemp)
+                diagn(2,9,iel)=ijel(2,6,iel)
+                ncon_edge(12,sje(1,1,6,ntemp))=.true.
+                if_1_edge(9,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &           sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          elseif(cb3.eq.2)then
+            if(ijel(1,6,iel).eq.2)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(9,iel)=.true.
+c.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(9,iel)=.true.
+                  diagn(1,9,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+            end if
+          end if
+
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(5,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,5,iel)=sje(1,1,6,ntemp)
+                diagn(2,5,iel)=ijel(1,6,iel)
+                ncon_edge(3,sje(1,1,6,ntemp))=.true.
+                if_1_edge(5,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+c.......one face 4
+        if(cb4.eq.0)then
+          if(cb1.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          endif
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          end if 
+           
+        elseif(cb4.eq.1)then
+          if(cb1.eq.2)then
+            if(ijel(2,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(4,iel)=.true.
+                if(cbc(1,ntemp).eq.3.and.
+     &            sje(1,1,1,ntemp).gt.iel)then
+                  diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp) 
+                endif
+              endif
+            end if
+          elseif(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,4,iel)=sje(1,1,1,ntemp)
+                diagn(2,4,iel)=ijel(1,4,iel)
+                ncon_edge(6,sje(1,1,1,ntemp))=.true.
+                if_1_edge(4,iel)=.true.
+              endif
+              if(cbc(1,ntemp).eq.3.and.
+     &          sje(1,1,1,ntemp).gt.iel)then
+                diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp)
+              endif
+            end if
+          end if
+          if(cb5.eq.2)then
+            if(ijel(1,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(12,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+                endif
+              endif
+            end if
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,12,iel)=sje(1,1,5,ntemp)
+                diagn(2,12,iel)=ijel(2,4,iel)
+                ncon_edge(9,sje(1,1,5,ntemp))=.true.
+                if_1_edge(12,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            end if
+          end if
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(2,ntemp).ne.3.or.sje(1,1,2,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(8,iel)=.true.
+              if(cbc(2,ntemp).eq.2)then
+                diagn(1,8,iel)=sje(1,1,2,ntemp)
+                diagn(2,8,iel)=ijel(1,4,iel)
+                ncon_edge(2,sje(1,1,2,ntemp))=.true.
+                if_1_edge(8,iel)=.true.
+              endif
+              if(cbc(2,ntemp).eq.3.and.
+     &          sje(1,1,2,ntemp).gt.iel)then
+                diagn(1,8,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+c.......on face 2
+        if(cb2.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          endif
+        elseif(cb2.eq.1)then
+          if(cb3.eq.2)then
+            if(ijel(2,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).ne.3.or.
+     &          sje(1,1,3,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(6,iel)=.true.
+                if(cbc(3,ntemp).eq.3.and.
+     &            sje(1,1,3,ntemp).gt.iel)then
+                  diagn(1,6,iel)=sje(ijel(1,2,iel),2,3,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(3,ntemp).ne.3.or.sje(1,1,3,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,6,iel)=sje(1,1,3,ntemp)
+                diagn(2,6,iel)=ijel(1,2,iel)
+                ncon_edge(4,sje(1,1,3,ntemp))=.true.
+                if_1_edge(6,iel)=.true.
+              endif
+              if(cbc(3,ntemp).eq.3.and.
+     &          sje(1,1,3,ntemp).gt.iel)then
+                diagn(1,6,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          endif
+          if(cb5.eq.2)then
+            if(ijel(1,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(7,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,7,iel)=sje(ijel(2,2,iel),2,5,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,7,iel)=sje(1,1,5,ntemp)
+                diagn(2,7,iel)=ijel(2,2,iel)
+                ncon_edge(1,sje(1,1,5,ntemp))=.true.
+                if_1_edge(7,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,7,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            endif
+          endif
+        end if
+
+c.......on face 1
+        if(cb1.eq.1)then
+          newe(iel)=newe(iel)+2
+          eassign(2,iel)=.true.
+          if(cb3.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).eq.2)then
+              diagn(1,2,iel)=sje(1,1,3,ntemp)
+              diagn(2,2,iel)=ijel(1,1,iel)
+              ncon_edge(8,sje(1,1,3,ntemp))=.true.
+              if_1_edge(2,iel)=.true.
+            elseif(cbc(3,ntemp).eq.3)then
+              diagn(1,2,iel)=sje(ijel(1,1,iel),1,3,ntemp)
+            endif
+          elseif(cb3.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(ijel(2,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+          end if
+
+          eassign(3,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,3,iel)=sje(1,1,5,ntemp)
+              diagn(2,3,iel)=ijel(2,1,iel)
+              ncon_edge(5,sje(1,1,5,ntemp))=.true.
+              if_1_edge(3,iel)=.true.
+            elseif(cbc(5,ntemp).eq.3)then
+              diagn(1,3,iel)=sje(ijel(2,1,iel),1,5,ntemp)
+            endif
+          elseif(cb5.eq.2)then
+            ntemp=sje(1,1,5,iel)
+            if(ijel(1,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+            
+          end if
+        elseif(cb1.eq.2)then
+          if(cb3.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(2,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif 
+            endif
+          elseif(cb3.eq.0.or.cb3.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+            if(cb3.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif
+            endif
+          end if
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(3,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          end if
+        elseif(cb1.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+          endif
+        endif
+
+c.......on face 3
+        if(cb3.eq.1)then
+          newe(iel)=newe(iel)+1
+          eassign(10,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,10,iel)=sje(1,1,5,ntemp)
+              diagn(2,10,iel)=ijel(2,3,iel)
+              ncon_edge(11,sje(1,1,5,ntemp))=.true.
+              if_1_edge(10,iel)=.true.
+            endif
+          endif
+          if(ijel(1,3,iel).eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.3)then
+              diagn(1,10,iel)=sje(1,ijel(2,3,iel),5,ntemp)
+            endif
+          endif
+        elseif(cb3.eq.2)then
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(10,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif 
+            endif
+          end if
+        elseif(cb3.eq.0)then
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+          endif
+        endif
+
+c       CONFORMING FACE INTERIOR
+
+c.......find how many new mortar point indices will be assigned
+c       to face interiors on all faces on each element
+
+c.......newi record how many new face interior points will be assigned
+
+c.......on face 6
+        if(cb6.eq.1.or.cb6.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(6,iel)=.true.
+        end if
+c.......on face 4
+        if(cb4.eq.1.or.cb4.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(4,iel)=.true.
+        end if
+c.......on face 2
+        if(cb2.eq.1.or.cb2.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(2,iel)=.true.
+        end if
+c.......on face 1
+        if(cb1.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(1,iel)=.true.
+        end if
+c.......on face 3
+        if(cb3.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(3,iel)=.true.
+        endif
+c.......on face 5
+        if(cb5.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(5,iel)=.true.
+        endif
+
+c.......newc is the total number of new mortar point indices
+c       to be assigned to each element.
+        newc(iel)=newe(iel)*3+newi(iel)
+      end do
+
+c.....Compute (potentially in parallel) front(iel), which records how 
+c     many new mortar point indices are to be assigned (to conforming 
+c     edges and conforming face interiors) from element 1 to iel.
+c     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+      call ncopy(front,newc,nelt)
+
+      call parallel_add(front)
+
+c.....nmor is the total number or mortar points
+      nmor=nvertex+front(nelt)
+
+c.....Generate (potentially in parallel) new mortar point indices on 
+c     each conforming element face. On each face, first visit all 
+c     conforming edges, and then the face interior.
+
+      do iel=1,nelt
+        front(iel)=front(iel)-newc(iel)
+        count=nvertex+front(iel)
+        do i=1,6
+          cb1=cbc(i,iel)
+          if (i.le.2) then
+            ne=4
+            space=1
+          elseif (i.le.4)then
+            ne=3
+            space=2
+
+c.........i loops over faces. Only 4 faces need to be examed for edge visit.
+c         On face 1, edge 1,2,3 and 4 will be visited. On face 2, edge 5,6,7
+c         and 8 will be visited. On face 3, edge 9 and 10 will be visited and
+c         on face 4, edge 11 and 12 will be visited. The 12 edges can be 
+c         covered by four faces, there is no need to visit edges on face
+c         5 and 6.  So ne is set to be 0. 
+c         However, i still needs to loop over 5 and 6, since the interiors
+c         of face 5 and 6 still need to be visited.
+
+          else
+            ne=0
+            space=1
+          end if
+
+          do ie=1,ne,space
+            edge_g=edgenumber(ie,i)
+            if(eassign(edge_g,iel))then
+c.............generate the new mortar points index, mor_v
+              call mor_assign(mor_v,count)
+c.............assign mor_v to local edge ie of face i on element iel
+              call mor_edge(ie,i,iel,mor_v)
+
+c.............Since this edge is shared by another face of element 
+c             iel, assign mor_v to the corresponding edge on the other 
+c             face also.
+
+c.............find the other face
+              face2=f_e_ef(ie,i)
+c.............find the local edge index of this edge on the other face
+              ie2=localedgenumber(face2,edge_g)
+c.............asssign mor_v  to local edge ie2 of face face2 on element iel
+              call mor_edge(ie2,face2,iel,mor_v)
+
+c.............There are some neighbor elements also sharing this edge. Assign
+c             mor_v to neighbor element, neighbored by face i.
+              if (cbc(i,iel).eq.2)then
+                ntemp=sje(1,1,i,iel)
+                call mor_edge(ie,jjface(i),ntemp,mor_v)
+                call mor_edge(op(ie2),face2,ntemp,mor_v)
+              end if
+
+c.............assign mor_v  to neighbor element neighbored by face face2
+              if (cbc(face2,iel).eq.2)then
+                ntemp=sje(1,1,face2,iel)
+                call mor_edge(ie2,jjface(face2),ntemp,mor_v)
+                call mor_edge(op(ie),i,ntemp,mor_v)
+              end if
+
+c.............assign mor_v to neighbor element sharing this edge
+
+c.............if the neighbor is of the same size of iel
+              if(.not.if_1_edge(edgenumber(ie,i),iel))then
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_edge(op(ie2),jjface(face2),ntemp,mor_v)
+                  call mor_edge(op(ie),jjface(i),ntemp,mor_v)
+                endif
+
+c.............if the neighbor has a size larger than iel's
+              else
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_ne(mor_v,diagn(2,edgenumber(ie,i),iel),
+     &            ie,i,ie2,face2,iel,ntemp)
+                end if
+              endif
+ 
+            endif
+          end do 
+
+          if(fassign(i,iel))then
+c...........generate new mortar points index in face interior. 
+c           if face i is of type 2 or iel doesn't have a neighbor element,
+c           assign new mortar point indices to interior mortar points
+c           of face i of iel.
+            cb=cbc(i,iel)
+            if (cb.eq.1.or.cb.eq.0) then
+              do jj =2,lx1-1
+                do ii=2,lx1-1
+                  count=count+1
+                  idmo(ii,jj,1,1,i,iel)=count
+                end do
+              end do
+
+c...........if face i is of type 2, assign new mortar point indices
+c           to iel as well as to the neighboring element on face i
+            elseif (cb.eq.2) then
+              if (idmo(2,2,1,1,i,iel).eq.0) then
+                ntemp=sje(1,1,i,iel)
+                jface = jjface(i)
+                do jj =2,lx1-1
+                  do ii=2,lx1-1
+                    count=count+1
+                    idmo(ii,jj,1,1,i,iel)=count
+                    idmo(ii,jj,1,1,jface,ntemp)=count
+                  end do
+                end do
+              end if 
+            end if
+          end if
+        end do
+      end do 
+
+ 
+c.....for edges on nonconforming faces, copy the mortar points indices
+c     from neighbors.
+      do iel=1,nelt
+        do i=1,6
+          cb=cbc(i,iel)
+          if (cb.eq.3) then
+c...........edges 
+            call edgecopy_s(i,iel)
+          end if 
+
+c.........face interior 
+
+          jface = jjface(i)
+          if (cb.eq.3) then
+            do iii=1,2
+              do jjj=1,2
+                ntemp=sje(iii,jjj,i,iel) 
+                do jj =1,lx1
+                  do ii=1,lx1
+                    idmo(ii,jj,iii,jjj,i,iel)=
+     &                         idmo(ii,jj,1,1,jface,ntemp)
+                  end do
+                end do
+                idmo(1,1,iii,jjj,i,iel)=idmo(1,1,1,1,jface,ntemp)
+                idmo(lx1,1,iii,jjj,i,iel)=idmo(lx1,1,1,2,jface,ntemp)
+                idmo(1,lx1,iii,jjj,i,iel)=idmo(1,lx1,2,1,jface,ntemp)
+                idmo(lx1,lx1,iii,jjj,i,iel)=
+     &                         idmo(lx1,lx1,2,2,jface,ntemp)
+              end do
+            end do
+          end if
+        end do
+      end do
+      return
+      end
+       
+c-----------------------------------------------------------------
+       subroutine get_emo(ie,n,ng)
+c-----------------------------------------------------------------
+c      This subroutine fills array emo.
+c      emo  records all elements sharing the same mortar point 
+c                 (only applies to element vertices) .
+c      emo(1,i,n) gives the element ID of the i'th element sharing
+c                 mortar point n. (emo(1,i,n)=ie), ie is element
+c                 index.
+c      emo(2,i,n) gives the vertex index of mortar point n on this
+c                 element (emo(2,i,n)=ng), ng is the vertex index.
+c      nemo(n) records the total number of elements sharing mortar 
+c                 point n.
+c-----------------------------------------------------------------
+ 
+       include 'header.h'
+
+       integer ie, n, ntemp, i,ng
+       logical L1
+
+       L1=.false.
+       do i=1,nemo(n)
+         if (emo(1,i,n).eq.ie) L1=.true.
+       end do
+       if (.not.L1) then
+         ntemp=nemo(n)+1
+         nemo(n)=ntemp
+         emo(1,ntemp,n)=ie
+         emo(2,ntemp,n)=ng
+       end if
+
+       return
+       end 
+
+c-----------------------------------------------------------------
+      logical function ifsame(iel,i,ntemp,j)
+c-----------------------------------------------------------------
+c     Check whether the i's vertex of element iel is at the same
+c     location as j's vertex of element ntemp.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+      integer iel, i, ntemp, j
+
+      ifsame=.false.
+      if (ntemp.eq.0 .or. iel.eq.0) return
+      if (xc(i,iel).eq.xc(j,ntemp).and.
+     &    yc(i,iel).eq.yc(j,ntemp).and.
+     &    zc(i,iel).eq.zc(j,ntemp)) then
+        ifsame=.true.
+      end if
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine mor_assign(mor_v,count)
+c-----------------------------------------------------------------
+c     Assign three consecutive numbers for mor_v, which will
+c     be assigned to the three interior points of an edge as the 
+c     mortar point indices.
+c-----------------------------------------------------------------
+      
+      implicit none
+      integer mor_v(3),count,i
+   
+      do i=1,3 
+        count=count+1
+        mor_v(i)=count
+      end do
+
+      return
+      end  
+     
+c-----------------------------------------------------------------
+      subroutine mor_edge(ie,face,iel,mor_v)
+c-----------------------------------------------------------------
+c     Copy the mortar points index from mor_v to local 
+c     edge ie of the face'th face on element iel.
+c     The edge is conforming.
+c-----------------------------------------------------------------
+
+      include 'header.h'
+
+      integer ie,i,iel,mor_v(3),j,nn,face
+
+      if (ie.eq.1) then
+        j=1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.2) then 
+        i=lx1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.3) then 
+        j=lx1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.4) then 
+        i=1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      end if
+
+      return
+      end 
+
+c------------------------------------------------------------
+      subroutine edgecopy_s(face,iel)
+c------------------------------------------------------------
+c     Copy mortar points index on edges from neighbor elements 
+c     to an element face of the 3rd type.
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer face, iel, ntemp1, ntemp2, ntemp3, ntemp4, 
+     &         edge_g, edge_l, face2, mor_s_v(4,2), i
+
+c......find four neighbors on this face (3rd type)
+       ntemp1=sje(1,1,face,iel)
+       ntemp2=sje(1,2,face,iel)
+       ntemp3=sje(2,1,face,iel)
+       ntemp4=sje(2,2,face,iel)
+
+c......local edge 1
+
+c......mor_s_v is the array of mortar indices to  be copied.
+       call nrzero(mor_s_v,4*2)
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,1,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,1,1,2,jjface(face),ntemp1)
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(i,1,1,1,jjface(face),ntemp2)
+       end do
+
+c......copy mor_s_v to local edge 1 on this face
+       call mor_s_e(1,face,iel,mor_s_v)
+
+c......copy mor_s_v to the corresponding edge on the other face sharing
+c      local edge 1
+       face2=f_e_ef(1,face)
+       edge_g=edgenumber(1,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 2
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(lx1,i,1,1,jjface(face),ntemp2)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp2)
+
+       mor_s_v(1,2)=idmo(lx1,1,1,2,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(lx1,i,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(2,face,iel,mor_s_v)
+       face2=f_e_ef(2,face)
+       edge_g=edgenumber(2,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 3
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,lx1,1,1,jjface(face),ntemp3)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp3)
+
+       mor_s_v(1,2)=idmo(1,lx1,2,1,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(i,lx1,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(3,face,iel,mor_s_v)
+       face2=f_e_ef(3,face)
+       edge_g=edgenumber(3,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+c......local edge 4
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(1,i,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(1,lx1,2,1,jjface(face),ntemp1)
+
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(1,i,1,1,jjface(face),ntemp3)
+       end do
+
+       call mor_s_e(4,face,iel,mor_s_v)
+       face2=f_e_ef(4,face)
+       edge_g=edgenumber(4,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+       return
+       end
+
+c------------------------------------------------------------
+       subroutine mor_s_e(n,face,iel,mor_s_v)
+c------------------------------------------------------------
+c      Copy mortar points index from mor_s_v to local edge n
+c      on face "face" of element iel. The edge is nonconforming. 
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer n,face,iel,mor_s_v(4,2), i
+
+       if (n.eq.1) then
+         do i=2,lx1
+           idmo(i,1,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,1,1,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.2) then
+         do i=2,lx1
+          idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+          idmo(lx1,i,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.3) then
+         do i=2,lx1
+           idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,lx1,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.4) then
+         do i=2,lx1
+           idmo(1,i,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(1,i,2,1,face,iel)=mor_s_v(i,2)
+         end do
+       end if
+       return
+       end
+
+c------------------------------------------------------------
+       subroutine mor_s_e_nn(n,face,iel,mor_s_v,nn)
+c------------------------------------------------------------
+c      Copy mortar point indices from mor_s_v to local edge n
+c      on face "face" of element iel. nn is the edge mortar index,
+c      which indicates that mor_s_v  corresponds to left/bottom or 
+c      right/top part of the edge.
+c------------------------------------------------------------
+
+       include 'header.h'
+
+       integer n,face,iel,mor_s_v(4), i,nn
+
+       if (n.eq.1) then
+         if(nn.eq.1)then
+            do i=2,lx1
+              idmo(i,1,1,1,face,iel)=mor_s_v(i-1)
+            end do
+         else
+           do i=1,lx1-1
+             idmo(i,1,1,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.2) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(lx1,i,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.3) then
+         if(nn.eq.1)then
+           do i=2,lx1
+             idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(i,lx1,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.4) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(1,i,1,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(1,i,2,1,face,iel)=mor_s_v(i)
+           end do
+         endif
+       end if
+       return
+       end
+
+
+c---------------------------------------------------------------
+      subroutine mortar_vertex(i,iel,count)
+c---------------------------------------------------------------
+c     Assign mortar point index "count" to iel's i'th vertex
+c     and also to all elements sharing this vertex.
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,iel,count,ntempx(8),ifntempx(8),lc_a(3),nnb(3),
+     &        face_a(3),itemp,ntemp,ii, jj,j(3),
+     &        iintempx(3),l,nbe, lc, temp
+      logical ifsame,if_temp
+
+      do l= 1,8
+        ntempx(l)=0
+        ifntempx(l)=0
+      end do
+
+c.....face_a records the three faces sharing this vertex on iel.
+c     lc_a gives the local corner number of this vertex on each 
+c     face in face_a.
+
+      do l=1,3
+        face_a(l)=f_c(l,i)
+        lc_a(l)=local_corner(i,face_a(l))
+      end do
+
+c.....each vertex is shared by at most 8 elements. 
+c     ntempx(j) gives the element index of a POSSIBLE element with its 
+c               j'th  vertex is iel's i'th vertex
+c     ifntempx(i)=ntempx(i) means  ntempx(i) exists 
+c     ifntempx(i)=0 means ntempx(i) does not exist.
+
+      ntempx(9-i)=iel
+      ifntempx(9-i)=iel
+
+c.....first find all elements sharing this vertex, ifntempx
+
+c.....find the three possible neighbors of iel, neighbored by faces 
+c     listed in array face_a
+
+      do itemp= 1, 3
+
+c.......j(itemp) is the local corner number of this vertex on the 
+c       neighbor element on the corresponding face.
+        j(itemp)=c_f(lc_a(itemp),jjface(face_a(itemp)))
+
+c.......iitempx(itemp) records the vertex index of i on the
+c       neighbor element, neighborned by face_a(itemp)
+        iintempx(itemp)=cal_intempx(lc_a(itemp),face_a(itemp))
+
+c.......ntemp refers the neighbor element 
+        ntemp=0
+
+c.......if the face is nonconforming, find out in which piece of the 
+c       mortar the vertex is located
+        ii=cal_iijj(1,lc_a(itemp))
+        jj=cal_iijj(2,lc_a(itemp))
+        ntemp=sje(ii,jj,face_a(itemp),iel)
+
+c.......if the face is conforming
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(itemp),iel)
+c.........find the possible neighbor        
+          ntempx(iintempx(itemp))=ntemp
+c.........check whether this possible neighbor is a real neighbor or not
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+
+c.......if the face is nonconforming
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+              ntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+        end if 
+      end do 
+
+c.....find the possible three neighbors, neighbored by an edge only
+      do l=1,3
+
+c.....find first existing neighbor of any of the faces in array face_a
+        if_temp=.false.
+        if(l.eq.1)then
+          if_temp=.true.
+        elseif(l.eq.2)then
+          if(ifntempx(iintempx(l-1)).eq.0)then
+            if_temp=.true.
+          end if
+        elseif(l.eq.3)then
+          if(ifntempx(iintempx(l-1)).eq.0
+     &       .and.ifntempx(iintempx(l-2)).eq.0) then
+            if_temp=.true.
+          end if
+        end if
+
+        if(if_temp)then
+          if (ifntempx(iintempx(l)).ne.0) then
+            nbe=ifntempx(iintempx(l))
+c...........if 1st neighor exists, check the neighbor's two neighbors in
+c           the other two directions. 
+c           e.g. if l=1, check directions 2 and 3,i.e. itemp=2,3,1
+c           if l=2, itemp=3,1,-2
+c           if l=3, itemp=1,2,1
+c
+            do itemp=face_l1(l),face_l2(l),face_ld(l)
+c.............lc is the local corner number of this vertex on face face_a(itemp)
+c             on the neighbor element of iel, neighbored by a face face_a(l)
+              lc=local_corner(j(l),face_a(itemp))
+c.............temp is the vertex index of this vertex on the neighbor element
+c             neighbored by an edge
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+
+c.............if the face face_a(itemp) is conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+c...................nnb(itemp) records the neighbor element neighbored by an
+c                   edge only
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+
+c.............if the face face_a(itemp) is nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+              end if
+            end do
+
+c...........check the last neighbor element, neighbored by an edge
+
+c...........ifntempx(iintempx(l)) has been visited in the above, now 
+c           check another neighbor element(nbe) neighbored by a face 
+
+c...........if the neighbor element is neighbored by face 
+c           face_a(face_l1(l)) exists
+            if(ifntempx(iintempx(face_l1(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l1(l)))
+c.............itemp is the last direction other than l and face_l1(l)
+              itemp=face_l2(l)
+              lc=local_corner(j(face_l1(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+
+c.............ntemp records the last neighbor element neighbored by an edge
+c             with element iel
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+c.............if conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+c.............if nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+
+c...........if the neighbor element neighbored by face face_a(face_l2(l)) 
+c           does not exist
+            elseif(ifntempx(iintempx(face_l2(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l2(l)))
+              itemp=face_l1(l)
+              lc=local_corner(j(face_l2(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),
+     &               nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+            endif
+          endif
+        end if
+      end do
+
+c.....check the neighbor element, neighbored by a vertex only
+
+c.....nnb are the three possible neighbor elements neighbored by an edge
+
+      nnb(1)=ifntempx(cal_nnb(1,i))
+      nnb(2)=ifntempx(cal_nnb(2,i))
+      nnb(3)=ifntempx(cal_nnb(3,i))
+      ntemp=0
+
+c.....the neighbor element neighbored by a vertex must be a neighbor of
+c     a valid(nonzero) nnb(i), neighbored by a face 
+
+      if(nnb(1).ne.0)then
+        lc=oplc(local_corner(i,face_a(3)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+c.......ntemp records the neighbor of iel, neighbored by vertex i 
+        ntemp=sje(ii,jj,face_a(3),nnb(1))
+c.......temp is the vertex index of i on ntemp
+        temp=cal_intempx(lc,face_a(3))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(3),nnb(1))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(2).ne.0)then
+        lc=oplc(local_corner(i,face_a(1)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(1),nnb(2))
+        temp=cal_intempx(lc,face_a(1))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(1),nnb(2))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &         c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &         c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(3).ne.0)then
+        lc=oplc(local_corner(i,face_a(2)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(2),nnb(3))
+        temp=cal_intempx(lc, face_a(2))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(2),nnb(3))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &         c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,
+     &         c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      end if
+
+c.....ifntempx records all elements sharing this vertex, assign count
+c     to all these elements.
+
+      if (ifntempx(1).ne.0) then
+        idmo(lx1,lx1,2,2,1,ntempx(1))=count
+        idmo(lx1,lx1,2,2,3,ntempx(1))=count
+        idmo(lx1,lx1,2,2,5,ntempx(1))=count
+        call get_emo(ntempx(1),count,8)
+      end if
+
+      if (ifntempx(2).ne.0) then
+        idmo(lx1,lx1,2,2,2,ntempx(2))=count
+        idmo(1,lx1,2,1,3,ntempx(2))=count
+        idmo(1,lx1,2,1,5,ntempx(2))=count
+        call get_emo(ntempx(2),count,7)
+      end if
+
+      if (ifntempx(3).ne.0) then
+        idmo(1,lx1,2,1,1,ntempx(3))=count
+        idmo(lx1,lx1,2,2,4,ntempx(3))=count
+        idmo(lx1,1,1,2,5,ntempx(3))=count
+        call get_emo(ntempx(3),count,6)
+      end if
+      if (ifntempx(4).ne.0) then
+        idmo(1,lx1,2,1,2,ntempx(4))=count
+        idmo(1,lx1,2,1,4,ntempx(4))=count
+        idmo(1,1,1,1,5,ntempx(4))=count
+        call get_emo(ntempx(4),count,5)
+      end if
+
+      if (ifntempx(5).ne.0) then
+        idmo(lx1,1,1,2,1,ntempx(5))=count
+        idmo(lx1,1,1,2,3,ntempx(5))=count
+        idmo(lx1,lx1,2,2,6,ntempx(5))=count
+        call get_emo(ntempx(5),count,4)
+      end if
+
+
+      if (ifntempx(6).ne.0) then
+        idmo(lx1,1,1,2,2,ntempx(6))=count
+        idmo(1,1,1,1,3,ntempx(6))=count
+        idmo(1,lx1,2,1,6,ntempx(6))=count
+        call get_emo(ntempx(6),count,3)
+      end if
+
+      if (ifntempx(7).ne.0) then
+        idmo(1,1,1,1,1,ntempx(7))=count
+        idmo(lx1,1,1,2,4,ntempx(7))=count
+        idmo(lx1,1,1,2,6,ntempx(7))=count
+        call get_emo(ntempx(7),count,2)
+      end if
+
+      if (ifntempx(8).ne.0) then
+        idmo(1,1,1,1,2,ntempx(8))=count
+        idmo(1,1,1,1,4,ntempx(8))=count
+        idmo(1,1,1,1,6,ntempx(8))=count
+        call get_emo(ntempx(8),count,1)
+      end if
+
+      return
+      end
+
+     
+c---------------------------------------------------------------
+      subroutine mor_ne(mor_v,nn,edge,face,edge2,face2,ntemp,iel)
+c---------------------------------------------------------------
+c     Copy the mortar points index  (mor_v + vertex mortar point) from
+c     edge'th local edge on face'th face on element ntemp to iel.
+c     ntemp is iel's neighbor, neighbored by this edge only. 
+c     This subroutine is for the situation that iel is of larger
+c     size than ntemp.  
+c     face, face2 are face indices
+c     edge and edge2 are local edge numbers of this edge on face and face2
+c     nn is edge motar index, which indicate whether this edge
+c     corresponds to the left/bottom or right/top part of the edge
+c     on iel.
+c---------------------------------------------------------------
+      include 'header.h'
+
+      integer mor_v(3),nn,edge,face,edge2,face2,ntemp,iel, i, 
+     &mor_s_v(4)
+
+c.....get mor_s_v which is the mor_v + vertex mortar
+      if (edge.eq.3) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,lx1,2,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      
+      elseif (edge.eq.4) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(1,lx1,2,1,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+
+      elseif (edge.eq.1) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,1,1,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+         endif
+
+      else if (edge.eq.2) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+             mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(lx1,1,1,2,face,ntemp)
+          do i=2,lx1-1
+             mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      end if
+
+c.....copy mor_s_v to iel's local edge(op(edge)), on face jjface(face)
+      call mor_s_e_nn(op(edge),jjface(face),iel,mor_s_v,nn)
+c.....copy mor_s_v to iel's local edge(op(edge2)), on face jjface(face2)
+c     since this edge is shared by two faces on iel
+      call mor_s_e_nn(op(edge2),jjface(face2),iel,mor_s_v,nn)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/move.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/move.f
new file mode 100644
index 0000000..538d1e7
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/move.f
@@ -0,0 +1,82 @@
+c---------------------------------------------------------------
+      subroutine move
+c---------------------------------------------------------------
+c     move element to proper location in morton space filling curve
+c---------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,iside,jface,iel,ntemp,ii1,ii2,n1,n2,cb
+
+
+      n2=2*6*nelt
+      n1=n2*2
+      call nr_init(sje_new,n1,0)
+      call nr_init(ijel_new,n2,0)
+
+      do iel=1,nelt
+        i=mt_to_id(iel)
+        treenew(iel)=tree(i)
+        call copy(xc_new(1,iel),xc(1,i),8)
+        call copy(yc_new(1,iel),yc(1,i),8)
+        call copy(zc_new(1,iel),zc(1,i),8)
+
+        do iside=1,nsides
+          jface = jjface(iside)
+          cb=cbc(iside,i)
+          xc_new(iside,iel)=xc(iside,i)
+          yc_new(iside,iel)=yc(iside,i)
+          zc_new(iside,iel)=zc(iside,i)
+          cbc_new(iside,iel)=cb
+
+          if(cb.eq.2)then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=1
+            ijel_new(2,iside,iel)=1
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+
+          else if(cb.eq.1) then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=ijel(1,iside,i)
+            ijel_new(2,iside,iel)=ijel(2,iside,i)
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+         
+          else if(cb.eq.3) then
+            do ii2=1,2
+              do ii1=1,2
+                ntemp=sje(ii1,ii2,iside,i)
+                ijel_new(1,iside,iel)=1
+                ijel_new(2,iside,iel)=1
+                sje_new(ii1,ii2,iside,iel)=id_to_mt(ntemp)
+              end do
+            end do
+
+          else if(cb.eq.0)then
+            sje_new(1,1,iside,iel)=0
+            sje_new(1,2,iside,iel)=0
+            sje_new(2,1,iside,iel)=0
+            sje_new(2,2,iside,iel)=0       
+          end if 
+
+        end do
+
+        call copy(ta2(1,1,1,iel),ta1(1,1,1,i),nxyz)
+
+      end do
+
+      call copy(xc,xc_new,8*nelt)
+      call copy(yc,yc_new,8*nelt)
+      call copy(zc,zc_new,8*nelt)
+      call ncopy(sje,sje_new,4*6*nelt)
+      call ncopy(ijel,ijel_new,2*6*nelt)
+      call ncopy(cbc,cbc_new,6*nelt)
+      call ncopy(tree,treenew,nelt)
+      call copy(ta1,ta2,nxyz*nelt)
+
+      do iel=1,nelt
+        mt_to_id(iel)=iel
+        id_to_mt(iel)=iel
+      end do
+
+      return
+      end 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/precond.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/precond.f
new file mode 100644
index 0000000..120300f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/precond.f
@@ -0,0 +1,767 @@
+c------------------------------------------------------------------
+      subroutine setuppc
+c------------------------------------------------------------------
+c     Generate diagonal preconditioner for CG.
+c     Preconditioner computed in this subroutine is correct only
+c     for collocation point in element interior, on conforming face
+c     interior and conforming edge.
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision dxtm1_2(lx1,lx1), rdtime
+      integer ie,k,i,j,q,isize
+
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1_2(i,j)=dxtm1(i,j)**2
+        end do
+      end do
+
+      rdtime=1.d0/dtime
+
+      do ie = 1, nelt
+        call r_init(dpcelm(1,1,1,ie),nxyz,0.d0)
+        isize=size_e(ie)
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              do q = 1, lx1
+                dpcelm(i,j,k,ie) = dpcelm(i,j,k,ie) + 
+     &                        g1m1_s(q,j,k,isize) * dxtm1_2(i,q) +
+     &                        g1m1_s(i,q,k,isize) * dxtm1_2(j,q) +
+     &                        g1m1_s(i,j,q,isize) * dxtm1_2(k,q)
+              end do
+              dpcelm(i,j,k,ie)=visc*dpcelm(i,j,k,ie)+
+     &                      rdtime*bm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      end do
+
+c.....do the stiffness summation
+      call dssum
+
+c.....take inverse.
+
+      call reciprocal(dpcelm,ntot)
+
+c.....compute preconditioner on mortar points. NOTE:  dpcmor for 
+c     nonconforming cases will be corrected in subroutine setpcmo 
+      do i=1,nmor
+        dpcmor(i)=1.d0/dpcmor(i)
+      end do
+
+      return
+      end
+
+
+c--------------------------------------------------------------
+      subroutine setpcmo_pre
+c--------------------------------------------------------------
+c     pre-compute elemental contribution to preconditioner  
+c     for all situations
+c--------------------------------------------------------------
+      
+      include 'header.h'
+
+      integer element_size, i, j, ii, jj, col
+      double precision
+     &       p(lx1,lx1,lx1), p0(lx1,lx1,lx1), mtemp(lx1,lx1), 
+     &       temp(lx1,lx1,lx1), temp1(lx1,lx1), tmp(lx1,lx1),tig(lx1)
+
+c.....corners on face of type 3 
+
+      call r_init(tcpre,lx1*lx1,0.d0)
+      call r_init(tmp,lx1*lx1,0.d0)
+      call r_init(tig,5,0.d0)
+      tig(1)   =1.d0
+      tmp(1,1) =1.d0 
+
+c.....tcpre results from mapping a unit spike field (unity at 
+c     collocation point (1,1), zero elsewhere) on an entire element
+c     face to the (1,1) segment of a nonconforming face
+      do i=2,lx1-1
+        do j=1,lx1
+          tmp(i,1) = tmp(i,1)+ qbnew(i-1,j,1)*tig(j)
+        end do
+      end do
+ 
+      do col=1,lx1
+        tcpre(col,1)=tmp(col,1)
+
+        do j=2,lx1-1
+          do i=1,lx1
+            tcpre(col,j) = tcpre(col,j) + qbnew(j-1,i,1)*
+     &                                     tmp(col,i)
+          end do
+        end do
+      end do
+
+      do element_size=1,refine_max
+
+c.......for conforming cases
+
+c.......pcmor_c (i,j,element_size) records the intermediate value 
+c       (preconditioner=1/pcmor_c) of the preconditor on collocation 
+c       point (i,j) on a conforming face of an element of size 
+c       element_size.
+
+        do j=1,lx1/2+1
+          do i=j,lx1/2+1
+            call r_init(p,nxyz,0.d0)
+            p(i,j,1)=1.d0
+            call laplacian(temp,p,element_size)
+            pcmor_c(i,j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,j,element_size)=temp(i,j,1)
+            pcmor_c(j,i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,i,element_size)=temp(i,j,1)
+            pcmor_c(j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(i,lx1+1-j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,lx1+1-j,element_size)=temp(i,j,1)
+          end do
+        end do
+
+c.......for nonconforming cases 
+
+c.......nonconforming face interior
+
+c.......pcmor_nc1(i,j,ii,jj,element_size) records the intermediate 
+c       preconditioner value on collocation point (i,j) on mortar 
+c       (ii,jj)  on a nonconforming face of an element of size element_
+c       size
+        do j=2,lx1
+          do i=j,lx1
+            call r_init(mtemp,lx1*lx1,0.d0)
+            call r_init(p,nxyz,0.d0)
+            mtemp(i,j)=1.d0
+c...........when i, j=lx1, mortar points are duplicated, so mtemp needs
+c           to be doubled.
+            if(i.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            if(j.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            call transf_nc(mtemp,p)
+            call laplacian(temp,p,element_size)
+            call transfb_nc1(temp1,temp)
+
+c...........values at points (i,j) and (j,i) are the same
+            pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)
+            pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)
+          end do
+
+c.........when i, j=lx1, mortar points are duplicated. so pcmor_nc1 needs
+c         to be doubled on those points
+          pcmor_nc1(lx1,j,1,1,element_size)=
+     &          pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+          pcmor_nc1(j,lx1,1,1,element_size)=
+     &          pcmor_nc1(lx1,j,1,1,element_size)
+
+        end do
+        pcmor_nc1(lx1,lx1,1,1,element_size)=
+     &      pcmor_nc1(lx1,lx1,1,1,element_size)*2.d0
+
+c.......nonconforming edges
+        j=1
+        do i=2,lx1
+          call r_init(mtemp,lx1*lx1,0.d0)
+          call r_init(p,nxyz,0.d0)
+          call r_init(p0,nxyz,0.d0)
+          mtemp(i,j)=1.d0
+          if(i.eq.lx1)mtemp(i,j)=2.d0
+          call transf_nc(mtemp,p)
+          call laplacian(temp,p,element_size)                          
+          call transfb_nc1(temp1,temp)                   
+          pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)      
+          pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)                              
+          do ii=1,lx1
+c...........p0 is for the case that a nonconforming edge is shared by
+c           two conforming faces
+            p0(ii,1,1)=p(ii,1,1)
+            do jj=1,lx1 
+c.............now p is for the case that a nonconforming edge is shared
+c             by nonconforming faces
+              p(ii,1,jj)=p(ii,jj,1)
+            end do
+          end do
+
+          call laplacian(temp,p,element_size)
+          call transfb_nc2(temp1,temp)                
+
+c.........pcmor_nc2(i,j,ii,jj,element_size) gives the intermediate
+c         preconditioner value on collocation point (i,j) on a 
+c         nonconforming face of an element with size size_element
+
+          pcmor_nc2(i,j,1,1,element_size)=temp1(i,j)*2.d0 
+          pcmor_nc2(j,i,1,1,element_size)=
+     &          pcmor_nc2(i,j,1,1,element_size)
+
+          call laplacian(temp,p0,element_size) 
+          call transfb_nc0(temp1,temp)                  
+
+c.........pcmor_nc0(i,j,ii,jj,element_size) gives the intermediate
+c         preconditioner value on collocation point (i,j) on a 
+c         conforming face of an element, which shares a nonconforming 
+c         edge with another conforming face
+          pcmor_nc0(i,j,1,1,element_size)=temp1(i,j)
+          pcmor_nc0(j,i,1,1,element_size)=temp1(i,j)
+        end do
+        pcmor_nc1(lx1,j,1,1,element_size)=
+     &        pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc1(j,lx1,1,1,element_size)=
+     &        pcmor_nc1(lx1,j,1,1,element_size)
+        pcmor_nc2(lx1,j,1,1,element_size)=
+     &        pcmor_nc2(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc2(j,lx1,1,1,element_size)=
+     &        pcmor_nc2(lx1,j,1,1,element_size)
+        pcmor_nc0(lx1,j,1,1,element_size)=
+     &        pcmor_nc0(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc0(j,lx1,1,1,element_size)=
+     &        pcmor_nc0(lx1,j,1,1,element_size)
+
+c.......symmetrical copy
+        do i=1,lx1-1
+          pcmor_nc1(i,j,1,2,element_size)=
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do
+
+        do j=2,lx1                                            
+          do i=1,lx1-1
+            pcmor_nc1(i,j,1,2,element_size)=
+     &            pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,1,2,element_size)=
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do                                                
+
+        j=1
+        i=1
+        pcmor_nc1(i,j,2,1,element_size)=
+     &        pcmor_nc1(i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,1,element_size)=
+     &        pcmor_nc0(i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,1,element_size)=
+     &        pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        do j=2,lx1-1
+          i=1
+          pcmor_nc1(i,j,2,1,element_size)=
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+          do i=2,lx1
+            pcmor_nc1(i,j,2,1,element_size)=
+     &            pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          end do
+        end do
+
+        j=lx1
+        do i=2,lx1
+          pcmor_nc1(i,j,2,1,element_size)=
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        end do
+
+        j=1
+        i=lx1
+        pcmor_nc1(i,j,2,2,element_size)=
+     &        pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,2,element_size)=
+     &        pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,2,element_size)=
+     &        pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)
+          
+        do j=2,lx1-1                                            
+          do i=2,lx1-1
+            pcmor_nc1(i,j,2,2,element_size)=
+     &            pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,2,2,element_size)=                                       
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)                               
+          pcmor_nc0(i,j,2,2,element_size)=                                       
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)   
+          pcmor_nc2(i,j,2,2,element_size)=                                       
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)                     
+        end do                                                
+        j=lx1
+        do i=2,lx1-1
+          pcmor_nc1(i,j,2,2,element_size)=                                       
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc0(i,j,2,2,element_size)=
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc2(i,j,2,2,element_size)=                                       
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)    
+        end do
+
+
+c.......vertices shared by at least one nonconforming face or edge
+
+c.......Among three edges and three faces sharing a vertex on an element
+c       situation 1: only one edge is nonconforming
+c       situation 2: two edges are nonconforming
+c       situation 3: three edges are nonconforming
+c       situation 4: one face is nonconforming 
+c       situation 5: one face and one edge are nonconforming 
+c       situation 6: two faces are nonconforming
+c       situation 7: three faces are nonconforming
+
+        call r_init(p0,nxyz,0.d0)
+        p0(1,1,1)=1.d0
+        call laplacian(temp,p0,element_size)
+        pcmor_cor(8,element_size)=temp(1,1,1)
+
+c.......situation 1
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size) 
+        call transfb_cor_e(1,pcmor_cor(1,element_size),temp)                  
+
+c.......situation 2
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(2,pcmor_cor(2,element_size),temp)                  
+
+c.......situation 3
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(3,pcmor_cor(3,element_size),temp)                  
+
+c.......situation 4
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(4,pcmor_cor(4,element_size),temp)
+
+c.......situation 5
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        do i=1,lx1
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(5,pcmor_cor(5,element_size),temp)
+ 
+c.......situation 6
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(6,pcmor_cor(6,element_size),temp)
+
+c.......situation 7
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+            p0(1,i,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(7,pcmor_cor(7,element_size),temp)
+
+      end do    
+      return
+      end 
+
+
+c------------------------------------------------------------------------
+      subroutine setpcmo
+c------------------------------------------------------------------------
+c     compute the preconditioner by identifying its geometry configuration
+c     and sum the values from the precomputed elemental contributions
+c------------------------------------------------------------------------
+      
+      include 'header.h'
+
+      integer face2, nb1, nb2, sizei, imor, enum, i,j, 
+     &        iel, iside, nn1, nn2
+
+      call l_init(ifpcmor,nvertex,.false.)
+      call l_init(edgevis,24*nelt,.false.)
+
+
+      do iel=1,nelt
+        do iside=1,nsides
+c.........for nonconforming faces
+          if(cbc(iside,iel).eq.3)then
+            sizei=size_e(iel)
+
+c...........vertices
+
+c...........ifpcmor(imor)=.true. indicates that mortar point imor has 
+c           been visited
+            imor=idmo(1,1,1,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+c.............compute the preconditioner on mortar point imor
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,1,1,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(1,lx1,2,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,lx1,2,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+c...........edges on nonconforming faces, enum is local edge number
+            do enum=1,4
+
+c.............edgevis(enum,iside,iel)=.true. indicates that local edge 
+c             enum of face iside of iel has been visited
+              if(.not.edgevis(enum,iside,iel))then
+                edgevis(enum,iside,iel)=.true.
+
+c...............Examing neighbor element information,
+c               calculateing the preconditioner value.
+                face2= f_e_ef(enum,iside)
+                if(cbc(face2,iel).eq.2)then
+                  nb1=sje(1,1,face2,iel)
+                  if(cbc(iside,nb1).eq.2)then
+
+c...................Compute the preconditioner on local edge enum on face
+c                   iside of element iel, 1 is neighborhood information got
+c                   by examing neighbors(nb1). For detailed meaning of 1, 
+c                   see subroutine com_dpc.
+
+                    call com_dpc(iside,iel,enum,1,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(e_face2(enum,iside)),
+     &                      jjface(face2),nb2)=.true.
+
+                  elseif(cbc(iside,nb1).eq.3)then
+                    call com_dpc(iside,iel,enum,2,sizei)
+                    edgevis(op(enum),iside,nb1)=.true.
+                  end if
+
+                elseif(cbc(face2,iel).eq.3)then
+                  edgevis(e_face2(enum,iside),face2,iel)=.true.
+                  nb1=sje(1,2,face2,iel)
+                  if(cbc(iside,nb1).eq.1)then
+                    call com_dpc(iside,iel,enum,3,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(enum),jjface(iside),nb2)=.true.
+                    edgevis(op(e_face2(enum,iside)),
+     &                      jjface(face2),nb2)=.true.
+                  elseif(cbc(iside,nb1).eq.2)then
+                    call com_dpc(iside,iel,enum,4,sizei)
+                  end if
+                else if (cbc(face2,iel).eq.0)then
+                  call com_dpc(iside,iel,enum,0,sizei)
+                end if
+              end if
+            end do
+
+c...........mortar element interior (not edge of mortar) 
+
+            do nn1=1,2
+              do nn2=1,2
+                do j=2,lx1-1
+                  do i=2,lx1-1
+                    imor=idmo(i,j,nn1,nn2,iside,iel) 
+                    dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,nn1,nn2,sizei)+
+     &                                pcmor_c(i,j,sizei+1))
+                  end do
+                end do
+              end do
+            end do
+
+c...........for i,j=lx1 there are duplicated mortar points, so 
+c           pcmor_c needs to be doubled or quadrupled
+            i=lx1
+            do j=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)            
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,2,1,iside,iel)                
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,2,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do      
+
+            j=lx1
+            imor=idmo(i,j,1,1,iside,iel)                                         
+            dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                        pcmor_c(i,j,sizei+1)*4.d0)
+            do i=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)  
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,1,2,iside,iel) 
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,2,sizei)+
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do
+
+          end if 
+        end do
+      end do
+
+      return
+      end
+
+c--------------------------------------------------------------------------
+      subroutine pc_corner(imor)
+c------------------------------------------------------------------------
+c     calculate preconditioner value for vertex with mortar index imor
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmortemp
+      integer imor, inemo,ie, sizei,cornernumber,
+     &        sface,sedge,iiface,iface,iiedge,iedge,n
+
+      tmortemp=0.d0
+c.....loop over all elements sharing this vertex
+      do inemo=1,nemo(imor)
+        ie=emo(1,inemo,imor)
+        sizei=size_e(ie)
+        cornernumber=emo(2,inemo,imor)
+        sface=0
+        sedge=0
+        do iiface=1,3
+          iface=f_c(iiface,cornernumber)
+c.........sface sums the number of nonconforming faces sharing this vertex on
+c         one element
+          if(cbc(iface,ie).eq.3)then
+            sface=sface+1
+          end if
+        end do
+c.......sedge sums the number of nonconforming edges sharing this vertex on
+c       one element
+        do iiedge=1,3
+          iedge=e_c(iiedge,cornernumber)
+          if(ncon_edge(iedge,ie))sedge=sedge+1
+        end do
+
+c.......each n indicates how many nonconforming faces and nonconforming
+c       edges share this vertex on an element, 
+
+        if(sface.eq.0)then
+          if(sedge.eq.0)then
+             n=8
+          elseif(sedge.eq.1)then
+             n=1
+          elseif(sedge.eq.2)then
+             n=2
+          elseif(sedge.eq.3)then
+             n=3
+          end if 
+        elseif (sface.eq.1)then
+          if (sedge.eq.1)then
+           n=5
+          else
+           n=4
+          end if
+        else if (sface.eq.2)then
+           n=6
+        else if(sface.eq.3)then
+           n=7
+        end if
+          
+c.......sum the intermediate pre-computed preconditioner values for 
+c       all elements
+        tmortemp=tmortemp+pcmor_cor(n,sizei)
+
+      end do
+
+c.....dpcmor(imor) is the value of the preconditioner on mortar point imor
+      dpcmor(imor)=1.d0/tmortemp
+
+      return
+      end 
+
+c------------------------------------------------------------------------
+      subroutine com_dpc(iside,iel,enumber,n,isize)
+c------------------------------------------------------------------------
+c     Compute preconditioner for local edge enumber of face iside 
+c     on element iel.
+c     isize is element size,
+c     n is one of five different configurations
+c     anc1, ac, anc2, anc0 are coefficients for different edges. 
+c     nc0 refers to nonconforming edge shared by two conforming faces
+c     nc1 refers to nonconforming edge shared by one nonconforming face
+c     nc2 refers to nonconforming edges shared by two nonconforming faces
+c     c refers to conforming edge
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer n, isize,iside,iel, enumber, nn1start, nn1end, nn2start, 
+     &        nn2end, jstart, jend, istart, iend, i, j, nn1, nn2, imor
+      double precision anc1,ac,anc2,anc0,temp
+
+c.....different local edges have different loop ranges 
+      if(enumber.eq.1)then
+        nn1start=1
+        nn1end=1
+        nn2start=1
+        nn2end=2
+        jstart=1
+        jend=1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.2) then
+        nn1start=1
+        nn1end=2
+        nn2start=2
+        nn2end=2
+        jstart=2
+        jend=lx1-1
+        istart=lx1
+        iend=lx1
+      elseif (enumber.eq.3) then
+        nn1start=2
+        nn1end=2
+        nn2start=1
+        nn2end=2
+        jstart=lx1
+        jend=lx1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.4) then
+        nn1start=1
+        nn1end=2
+        nn2start=1
+        nn2end=1
+        jstart=2
+        jend=lx1-1
+        istart=1
+        iend=1
+      end if
+
+c.....among the four elements sharing this edge
+
+c.....one has a smaller size
+      if(n.eq.1)then
+        anc1=2.d0
+        ac=1.d0
+        anc0=1.d0
+        anc2=0.d0
+
+c.....two (neighbored by a face) are of  smaller size
+      else if (n.eq.2)then
+        anc1=2.d0
+        ac=2.d0
+        anc0=0.d0
+        anc2=0.d0
+
+c.....two (neighbored by an edge) are of smaller size
+      else if (n.eq.3)then
+        anc2=2.d0
+        ac=2.d0
+        anc1=0.d0
+        anc0=0.d0
+
+c.....three are of smaller size
+      else if (n.eq.4)then
+        anc1=0.d0
+        ac=3.d0
+        anc2=1.d0
+        anc0=0.d0
+
+c.....on the boundary
+      else if (n.eq.0)then
+        anc1=1.d0
+        ac=1.d0
+        anc2=0.d0
+        anc0=0.d0
+      end if
+
+c.....edge interior
+      do nn2=nn2start,nn2end
+        do nn1=nn1start,nn1end
+          do j=jstart,jend
+            do i=istart,iend
+              imor=idmo(i,j,nn1,nn2,iside,iel)
+              temp=anc1* pcmor_nc1(i,j,nn1,nn2,isize) +
+     &             ac*  pcmor_c(i,j,isize+1)+
+     &             anc0*  pcmor_nc0(i,j,nn1,nn2,isize)+
+     &             anc2*pcmor_nc2(i,j,nn1,nn2,isize)
+                dpcmor(imor)=1.d0/temp
+              end do
+            end do
+          end do
+        end do
+
+c.......local edge 1
+        if (enumber.eq.1) then
+          imor=idmo(lx1,1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,1,1,1,isize) +
+     &         ac*  pcmor_c(lx1,1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,1,1,1,isize)+
+     &         anc2*pcmor_nc2(lx1,1,1,1,isize)
+c.......local edge 2
+        elseif (enumber.eq.2) then
+          imor=idmo(lx1,lx1,1,2,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,1,2,isize) +
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,lx1,1,2,isize)+
+     &         anc2*pcmor_nc2(lx1,lx1,1,2,isize)
+c.......local edge 3
+        elseif (enumber.eq.3) then
+          imor=idmo(lx1,lx1,2,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,2,1,isize) +
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(lx1,lx1,2,1,isize)+
+     &         anc2*pcmor_nc2(lx1,lx1,2,1,isize)
+c.......local edge 4
+        elseif (enumber.eq.4) then
+          imor=idmo(1,lx1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(1,lx1,1,1,isize) +
+     &         ac*  pcmor_c(1,lx1,isize+1)*2.d0+
+     &         anc0*  pcmor_nc0(1,lx1,1,1,isize)+
+     &         anc2*pcmor_nc2(1,lx1,1,1,isize)
+        end if
+
+        dpcmor(imor)=1.d0/temp
+
+      return
+      end 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/setup.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/setup.f
new file mode 100644
index 0000000..1c8dd90
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/setup.f
@@ -0,0 +1,517 @@
+c-----------------------------------------------------------------
+      subroutine create_initial_grid        
+c------------------------------------------------------------------
+    
+      include 'header.h'
+
+      integer i
+
+      nelt=1
+      ntot=nelt*lx1*lx1*lx1 
+      tree(1)=1
+      mt_to_id(1)=1
+      do i=1,7,2
+        xc(i,1)=0.d0
+        xc(i+1,1)=1.d0
+      end do
+
+      do i=1,2
+        yc(i,1)=0.d0
+        yc(2+i,1)=1.d0
+        yc(4+i,1)=0.d0
+        yc(6+i,1)=1.d0
+      end do
+     
+      do i=1,4
+        zc(i,1)=0.d0
+        zc(4+i,1)=1.d0
+      end do
+  
+      do i=1,6
+        cbc(i,1)=0
+      end do
+
+      return
+
+      end
+
+c-----------------------------------------------------------------
+      subroutine coef
+c-----------------------------------------------------------------
+c
+c     generate 
+c
+c            - collocation points
+c            - weights
+c            - derivative matrices 
+c            - projection matrices
+c            - interpolation matrices 
+c
+c     associated with the 
+c
+c            - gauss-legendre lobatto mesh (suffix m1)
+c
+c----------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,k
+
+c.....for gauss-legendre lobatto mesh (suffix m1)
+c.....generate collocation points and weights 
+
+      zgm1(1)=-1.d0
+      zgm1(2)=-0.6546536707079771d0
+      zgm1(3)=0.d0
+      zgm1(4)= 0.6546536707079771d0
+      zgm1(5)=1.d0
+      wxm1(1)=0.1d0
+      wxm1(2)=49.d0/90.d0
+      wxm1(3)=32.d0/45.d0
+      wxm1(4)=wxm1(2)
+      wxm1(5)=0.1d0 
+
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            w3m1(i,j,k)=wxm1(i)*wxm1(j)*wxm1(k)
+          end do
+        end do
+      end do
+
+c.....generate derivative matrices
+
+      dxm1(1,1)=-5.0d0
+      dxm1(2,1)=-1.240990253030982d0
+      dxm1(3,1)= 0.375d0
+      dxm1(4,1)=-0.2590097469690172d0
+      dxm1(5,1)= 0.5d0
+      dxm1(1,2)= 6.756502488724238d0
+      dxm1(2,2)= 0.d0
+      dxm1(3,2)=-1.336584577695453d0
+      dxm1(4,2)= 0.7637626158259734d0
+      dxm1(5,2)=-1.410164177942427d0
+      dxm1(1,3)=-2.666666666666667d0
+      dxm1(2,3)= 1.745743121887939d0
+      dxm1(3,3)= 0.d0
+      dxm1(4,3)=-dxm1(2,3)
+      dxm1(5,3)=-dxm1(1,3)
+      do j=4,lx1
+        do i=1,lx1
+          dxm1(i,j)=-dxm1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1(i,j)=dxm1(j,i)
+        end do
+      end do
+
+c.....generate projection (mapping) matrices
+
+      qbnew(1,1,1)=-0.1772843218615690d0
+      qbnew(2,1,1)=9.375d-02
+      qbnew(3,1,1)=-3.700139242414530d-02
+      qbnew(1,2,1)= 0.7152146412463197d0
+      qbnew(2,2,1)=-0.2285757930375471d0
+      qbnew(3,2,1)= 8.333333333333333d-02
+      qbnew(1,3,1)= 0.4398680650316104d0
+      qbnew(2,3,1)= 0.2083333333333333d0
+      qbnew(3,3,1)=-5.891568407922938d-02
+      qbnew(1,4,1)= 8.333333333333333d-02
+      qbnew(2,4,1)= 0.3561799597042137d0
+      qbnew(3,4,1)=-4.854797457965334d-02
+      qbnew(1,5,1)= 0.d0
+      qbnew(2,5,1)=7.03125d-02
+      qbnew(3,5,1)=0.d0
+      
+      do j=1,lx1
+        do i=1,3
+          qbnew(i,j,2)=qbnew(4-i,lx1+1-j,1)
+        end do
+      end do 
+
+c.....generate interpolation matrices for mesh refinement
+
+      ixtmc1(1,1)=1.d0
+      ixtmc1(2,1)=0.d0
+      ixtmc1(3,1)=0.d0
+      ixtmc1(4,1)=0.d0
+      ixtmc1(5,1)=0.d0 
+      ixtmc1(1,2)= 0.3385078435248143d0
+      ixtmc1(2,2)= 0.7898516348912331d0
+      ixtmc1(3,2)=-0.1884018684471238d0
+      ixtmc1(4,2)= 9.202967302175333d-02
+      ixtmc1(5,2)=-3.198728299067715d-02
+      ixtmc1(1,3)=-0.1171875d0
+      ixtmc1(2,3)= 0.8840317166357952d0
+      ixtmc1(3,3)= 0.3125d0    
+      ixtmc1(4,3)=-0.118406716635795d0 
+      ixtmc1(5,3)= 0.0390625d0   
+      ixtmc1(1,4)=-7.065070066767144d-02
+      ixtmc1(2,4)= 0.2829703269782467d0 
+      ixtmc1(3,4)= 0.902687582732838d0
+      ixtmc1(4,4)=-0.1648516348912333d0 
+      ixtmc1(5,4)= 4.984442584781999d-02
+      ixtmc1(1,5)=0.d0
+      ixtmc1(2,5)=0.d0
+      ixtmc1(3,5)=1.d0 
+      ixtmc1(4,5)=0.d0
+      ixtmc1(5,5)=0.d0  
+      do j=1,lx1
+        do i=1,lx1
+          ixmc1(i,j)=ixtmc1(j,i)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixtmc2(i,j)=ixtmc1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixmc2(i,j)=ixtmc2(j,i)
+        end do
+      end do
+
+c.....solution interpolation matrix for mesh coarsening
+
+      map2(1)=-0.1179652785083428d0
+      map2(2)= 0.5505046330389332d0
+      map2(3)= 0.7024534364259963d0
+      map2(4)=-0.1972224518285866d0
+      map2(5)= 6.222966087199998d-02
+
+      do i=1,lx1
+        map4(i)=map2(lx1+1-i)
+      end do
+
+      return
+      end
+
+c-------------------------------------------------------------------
+      subroutine geom1
+c-------------------------------------------------------------------
+c
+c     routine to generate elemental geometry information on mesh m1,
+c     (gauss-legendre lobatto mesh).
+c
+c         xrm1_s   -   dx/dr, dy/dr, dz/dr
+c         rxm1_s   -   dr/dx, dr/dy, dr/dz
+c         g1m1_s  geometric factors used in preconditioner computation
+c         g4m1_s  g5m1_s  g6m1_s :
+c         geometric factors used in lapacian opertor
+c         jacm1    -   jacobian
+c         bm1      -   mass matrix
+c         xfrac    -   will be used in prepwork for calculating collocation
+c                      coordinates
+c         idel     -   collocation points index on element boundaries 
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision temp,temp1,temp2,dtemp
+      integer isize,i,j,k,ntemp,iel
+ 
+      do i=1,lx1
+        xfrac(i)=zgm1(i)*0.5d0 + 0.5d0
+      end do
+
+ 
+      do isize=1,refine_max
+        temp=2.d0**(-isize-1)
+        dtemp=1.d0/temp
+        temp1=temp**3
+        temp2=temp**2
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              xrm1_s(i,j,k,isize)=dtemp
+              jacm1_s(i,j,k,isize)=temp1
+              rxm1_s(i,j,k,isize)=temp2
+              g1m1_s(i,j,k,isize)=w3m1(i,j,k)*temp
+              bm1_s(i,j,k,isize)=w3m1(i,j,k)*temp1
+              g4m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(i)
+              g5m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(j)
+              g6m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(k)
+            end do
+          end do
+        end do
+      end do
+
+      do iel = 1, lelt
+        ntemp=lx1*lx1*lx1*(iel-1)
+        do j = 1, lx1
+          do i = 1, lx1
+            idel(i,j,1,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+lx1
+            idel(i,j,2,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+1
+            idel(i,j,3,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+lx1*(lx1-1)+1
+            idel(i,j,4,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+1
+            idel(i,j,5,iel)=ntemp+(i-1)*1 + (j-1)*lx1+lx1*lx1*(lx1-1)+1
+            idel(i,j,6,iel)=ntemp+(i-1)*1 + (j-1)*lx1+1
+          end do
+        end do
+      end do
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine setdef
+c------------------------------------------------------------------
+c     compute the discrete laplacian operators
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i,j,ip
+ 
+      call r_init(wdtdr(1,1),lx1*lx1,0.d0)
+
+      do i=1,lx1
+        do j=1,lx1
+          do ip=1,lx1
+            wdtdr(i,j) = wdtdr(i,j) + wxm1(ip)*dxm1(ip,i)*dxm1(ip,j)
+          end do
+        end do
+      end do
+
+      return 
+      end
+
+
+c------------------------------------------------------------------
+      subroutine prepwork
+c------------------------------------------------------------------
+c     mesh information preparations: calculate refinement levels of
+c     each element, mask matrix for domain boundary and element 
+c     boundaries
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      integer i, j, k, iel, iface, cb
+      double precision rdlog2
+
+      ntot = nelt*nxyz
+      rdlog2 = 1.d0/dlog(2.d0)
+
+
+c.....calculate the refinement levels of each element
+
+      do iel = 1, nelt
+        size_e(iel)=-dlog(xc(2,iel)-xc(1,iel))*rdlog2+1.d-8
+      end do
+
+c.....mask matrix for element boundary
+
+      do iel = 1, nelt
+        call r_init(tmult(1,1,1,iel),nxyz,1.d0)   
+        do iface=1,nsides
+          call facev(tmult(1,1,1,iel),iface,0.0d0)
+        end do
+      end do
+
+c.....masks for domain boundary at mortar 
+
+      call r_init(tmmor,nmor,1.d0)
+
+      do iel = 1, nelt
+        do iface = 1,nsides
+          cb=cbc(iface,iel)
+          if(cb.eq.0) then
+            do j=2,lx1-1
+              do i=2,lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            end do
+
+            j=1
+            do i = 1, lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+
+            if(idmo(lx1,1,1,1,iface,iel).eq.0)then
+              tmmor(idmo(lx1,1,1,2,iface,iel))=0.d0
+            else
+              tmmor(idmo(lx1,1,1,1,iface,iel))=0.d0
+              do i=1,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=lx1
+            if(idmo(lx1,2,1,2,iface,iel).eq.0)then
+              do j=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+              tmmor(idmo(lx1,lx1,2,2,iface,iel))=0.d0
+            else
+              do j=2,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+              do j=1,lx1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+            
+            j=lx1
+            tmmor(idmo(1,lx1,2,1,iface,iel))=0.d0
+            if(idmo(2,lx1,2,1,iface,iel).eq.0)then
+              do i=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            else
+              do i=2,lx1
+                tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+              do i=1,lx1-1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=1
+            do j=2,lx1-1
+             tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+            if(idmo(1,lx1,1,1,iface,iel).ne.0)then
+              tmmor(idmo(i,lx1,1,1,iface,iel))=0.d0
+              do j=1,lx1-1
+               tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+            end if
+
+          endif
+        end do
+       end do
+            
+      return
+      end 
+    
+
+c------------------------------------------------------------------
+      block data top_constants
+
+c------------------------------------------------------------------
+c.....We store some tables of useful topological constants
+c------------------------------------------------------------------
+      include 'header.h'
+
+c     f_e_ef(e,f) returns the other face sharing the e'th local edge of face f.
+      data f_e_ef/6,3,5,4, 6,3,5,4, 6,1,5,2, 6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+c.....e_c(n,j) returns n'th edge sharing the vertex j of an element
+      data e_c /5,8,11, 1,4,11,  5,6,9, 1,2,9,
+     &          7,8,12, 3,4,12, 6,7,10, 2,3,10/
+
+c.....local_corner(n,i) returns the local corner index of vertex n on face i
+      data local_corner /0,1,0,2,0,3,0,4, 1,0,2,0,3,0,4,0,
+     &                   0,0,1,2,0,0,3,4, 1,2,0,0,3,4,0,0,
+     &                   0,0,0,0,1,2,3,4, 1,2,3,4,0,0,0,0/
+
+c.....cal_nnb(n,i) returns the neighbor elements neighbored by n'th edge
+c     among the three edges sharing vertex i
+c     the elements are the eight children elements ordered as 1 to 8.
+      data cal_nnb/5,2,3, 6,1,4, 7,4,1, 8,3,2,
+     &             1,6,7, 2,5,8, 3,8,5, 4,7,6/
+
+c.....returns the opposite local corner index: 1-4,2-3
+      data oplc /4,3,2,1/
+
+c.....cal_iijj(i,n) returns the location of local corner number n on a face 
+c     i =1  to get ii, i=2 to get jj
+c     (ii,jj) is defined the same as in mortar location (ii,jj)
+      data cal_iijj /1,1, 1,2, 2,1, 2,2/
+
+c.....returns the adjacent(neighbored by a face) element's children,
+c     assumming a vertex is shared by eight child elements 1-8. 
+c     index n is local corner number on the face which is being 
+c     assigned the mortar index number
+      data cal_intempx /8,6,4,2, 7,5,3,1, 8,7,4,3, 
+     $                  6,5,2,1, 8,7,6,5, 4,3,2,1/
+
+c.....c_f(i,f) returns the vertex number of i'th local corner on face f
+      data c_f /2,4,6,8, 1,3,5,7, 3,4,7,8, 1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+c.....on each face of the parent element, there are four children element.
+c     le_arr(i,j,n) returns the i'th elements among the four children elements 
+c     n refers to the direction: 1 for x, 2 for y and 3 for z direction. 
+c     j refers to positive(0) or negative(1) direction on x, y or z direction.
+c     n=1,j=0 refers to face 1 and n=1, j=1 refers to face 2, n=2,j=0 refers to
+c     face 3.... 
+c     The current eight children are ordered as 8,1,2,3,4,5,6,7 
+      data    le_arr/8,2,4,6, 1,3,5,7, 
+     $               8,1,4,5, 2,3,6,7, 
+     $               8,1,2,3, 4,5,6,7/
+
+c.....jjface(n) returns the face opposite to face n
+      data jjface /2,1,4,3,6,5/
+
+cc.....edgeface(n,f) returns OTHER face which shares local edge n on face f
+c      integer edgeface(4,6)
+c      data edgeface /6,3,5,4, 6,3,5,4, 6,1,5,2, 
+c     $               6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+c.....e_face2(n,f) returns the local edge number of edge n on the
+c     other face sharing local edge n on face f
+      data e_face2 /2,2,2,2, 4,4,4,4, 3,2,3,2, 
+     $              1,4,1,4, 3,3,3,3, 1,1,1,1/
+
+c.....op(n) returns the local edge number of the edge which 
+c     is opposite to local edge n on the same face
+      data op /3,4,1,2/
+
+c.....localedgenumber(f,e) returns the local edge number for edge e
+c     on face f. A zero result value signifies illegal input
+      data localedgenumber /1,0,0,0,0,2, 2,0,2,0,0,0, 3,0,0,0,2,0, 
+     $                      4,0,0,2,0,0, 0,1,0,0,0,4, 0,2,4,0,0,0, 
+     $                      0,3,0,0,4,0, 0,4,0,4,0,0, 0,0,1,0,0,3, 
+     $                      0,0,3,0,3,0, 0,0,0,1,0,1, 0,0,0,3,1,0/
+
+c.....edgenumber(e,f) returns the edge index of local edge e on face f
+      data edgenumber / 1,2, 3,4,  5,6, 7,8,  9,2,10,6, 
+     $                 11,4,12,8, 12,3,10,7, 11,1, 9,5/
+
+c.....f_c(c,n) returns the face index of i'th face sharing vertex n 
+      data f_c /2,4,6, 1,4,6, 2,3,6, 1,3,6,
+     &          2,4,5, 1,4,5, 2,3,5, 1,3,5/
+
+c.....if two elements are neighbor by one edge, 
+c     e1v1(f1,f2) returns the smaller index of the two vertices on this 
+c     edge on one element
+c     e1v2 returns the larger index of the two vertices of this edge on 
+c     on element. exfor a vertex on element 
+c     e2v1 returns the smaller index of the two vertices on this edge on 
+c     another element
+c     e2v2 returns the larger index of the two vertiex on this edge on
+c     another element
+      data e1v1/0,0,4,2,6,2, 0,0,3,1,5,1, 4,3,0,0,7,3,
+     &          2,1,0,0,5,1, 6,5,7,5,0,0, 2,1,3,1,0,0/
+      data e2v1/0,0,1,3,1,5, 0,0,2,4,2,6, 1,2,0,0,1,5,
+     &          3,4,0,0,3,7, 1,2,1,3,0,0, 5,6,5,7,0,0/
+      data e1v2/0,0,8,6,8,4, 0,0,7,5,7,3, 8,7,0,0,8,4,
+     &          6,5,0,0,6,2, 8,7,8,6,0,0, 4,3,4,2,0,0/
+      data e2v2/0,0,5,7,3,7, 0,0,6,8,4,8, 5,6,0,0,2,6,
+     &          7,8,0,0,4,8, 3,4,2,4,0,0, 7,8,6,8,0,0/
+
+c.....children(n1,n)returns the four elements among the eight children 
+c     elements to be merged on face n of the parent element
+c     the IDs for the eight children are 1,2,3,4,5,6,7,8
+      data children/2,4,6,8, 1,3,5,7, 3,4,7,8, 
+     &              1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+c.....iijj(n1,n) returns the location of n's mortar on an element face
+c     n1=1 refers to x direction location and n1=2 refers to y direction
+      data iijj/1,1,1,2,2,1,2,2/
+
+c.....v_end(n) returns the index of collocation points at two ends of each
+c     direction
+      data v_end /1,lx1/
+
+c.....face_l1,face_l2,face_ld return for start,end,stride for a loop over faces 
+c     used on subroutine  mortar_vertex
+      data face_l1 /2,3,1/, face_l2 /3,1,2/, face_ld /1,-2,1/
+
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/transfer.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/transfer.f
new file mode 100644
index 0000000..134579c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/transfer.f
@@ -0,0 +1,944 @@
+c------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+c------------------------------------------------------------------
+c     Map values from mortar(tmor) to element(tx)
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+c.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+      do ie=1,nelt
+        do iface=1,nsides
+
+c.........get the collocation point index of the four local corners on the
+c         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+c.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then
+
+c...........nonconforming faces have four pieces of mortar, first map them to
+c           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+c.................in each row col, when coloumn i=1 or lx1, the value
+c                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+c.................in each row col, value in the interior three collocation
+c                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) + 
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+c...........mapping from two pieces of intermediate mortar tmp to element 
+c           face tx
+
+            do ije1=1, nnje
+
+c.............the first column, col=1, is an edge of face iface.
+c             the value on the three interior collocation points, tx, is 
+c             computed by applying mapping matrices qbnew to tmp.
+c             the mapping result is divided by 2, because there will be 
+c             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+c.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+c...............when i=1 or lx1, the collocation points are also on an edge of
+c               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+c...............compute the value at interior collocation points in 
+c               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+c.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+c...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+
+      return
+      end
+
+
+c------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+c------------------------------------------------------------------
+c     Map from element(tx) to mortar(tmor).
+c     tmor sums contributions from all elements.
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,tx(*),tmor(*),temp(lx1,lx1,2),
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+
+      call r_init(tmor,nmor,0.d0)
+
+      do ie=1,nelt
+        do iface=1,nsides
+c.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+c.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+c.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+c.........sum the values from tx to tmor for these four local corners
+c         only 1/3 of the value is summed, since there will be two duplicated
+c         contributions from the other two faces sharing this vertex 
+          tmor(ig1) = tmor(ig1)+tx(il1)*third
+          tmor(ig2) = tmor(ig2)+tx(il2)*third
+          tmor(ig3) = tmor(ig3)+tx(il3)*third
+          tmor(ig4) = tmor(ig4)+tx(il4)*third
+
+c.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+c...........nonconforming faces have four pieces of mortar, first map tx to
+c           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+c...............For mortar points on face edge (top and bottom), copy the 
+c               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+c...............For mortar points on face edge (top and bottom), calculate 
+c               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+c...............Use mapping matrices qbnew to map the value from tx to temp 
+c               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+c...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+c...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+c.................For the end point, which is on an edge (local edge 2,4), 
+c                 the contribution is halved since there will be duplicated 
+c                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+                  tmor(ig)=tmor(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+
+c.................In each row of collocation points on a piece of mortar, 
+c                 sum the contributions from interior collocation points 
+c                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+                    tmor(ig)=tmor(ig)+tmp
+                  end do
+                end do
+
+c...............For tmor on local edge 1 and 3, tmp is the contribution from
+c               an edge, so it is halved because of duplicated contribution
+c               from another face sharing this edge. tmp1 is contribution 
+c               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+                tmor(ig)=tmor(ig)+top(v_end(ije2),ije1)*0.5d0
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+                  tmor(ig)=tmor(ig)+tmp*0.5d0+tmp1 
+                end do
+              end do
+            end do
+
+c.........for conforming faces
+          else
+
+c.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+                tmor(ig)=tmor(ig)+tx(il)
+              end do
+            end do
+
+c...........edges of conforming faces
+
+c...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+c...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+c...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+          end if!nnje=1
+        end do
+      end do
+
+      return
+      end
+
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the edge to mortar mapping and
+c     calculates the mapping result on the mortar point at a vertex
+c     under situation 1,2, or 3.
+c     n refers to the configuration of three edges sharing a vertex, 
+c     n = 1: only one edge is nonconforming
+c     n = 2: two edges are nonconforming 
+c     n = 3: three edges are nonconforming 
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+c--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+c--------------------------------------------------------------
+c     This subroutine performs the mapping from face to mortar.
+c     Output tmor is the mapping result on a mortar vertex
+c     of situations of three edges and three faces sharing a vertex:
+c     n=4: only one face is nonconforming 
+c     n=5: one face and one edge are nonconforming
+c     n=6: two faces are nonconforming 
+c     n=7: three faces are nonconforming 
+c--------------------------------------------------------------
+      include 'header.h'
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+c-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+c------------------------------------------------------------------------
+c     Perform mortar to element mapping on a nonconforming face. 
+c     This subroutin is used when all entries in tmor are zero except
+c     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+c     mortar  (tmor only has two indices) and one piece of intermediate 
+c     mortar (tmp) are involved.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+c------------------------------------------------------------------------
+c     Performs mapping from element to mortar when the nonconforming 
+c     edges are shared by two conforming faces of an element.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by two nonconforming faces of an element.
+c     Although each face shall have four pieces of mortar, only value in
+c     one piece (location (1,1)) is used in the calling routine so only
+c     the value in the first mortar is calculated in this subroutine.
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1), 
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+c.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+c.....from intermediate mortar to mortar
+
+c.....On the nonconforming edge, temp is divided by 2 as there will be
+c     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+c------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+c------------------------------------------------------------------------
+c     Maps values from element to mortar when the nonconforming edges are
+c     shared by a nonconforming face and a conforming face of an element
+c------------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1), 
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+c.....Contribution from the nonconforming faces
+c     Since the calling subroutine is only interested in the value on the
+c     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+c.........temp is not divided by 2 here. It includes the contribution
+c         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+c-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for cg. All values from conforming 
+c     boundary are copied and summed on tmor.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+      call r_init(tmort,nmor,0.d0)
+
+
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+          end if!
+        end do
+      end do
+      return
+      end
+
+c-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+c-------------------------------------------------------------------
+c     Prepare initial guess for CG. All values from conforming 
+c     boundary are copied and summed in tmort. 
+c     mormult is multiplicity, which is used to average tmort.
+c-------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+      call r_init(tmort,nmor,0.d0)
+      call r_init(mormult,nmor,0.d0)
+
+
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+            mormult(ig1) = mormult(ig1)+third
+            mormult(ig2) = mormult(ig2)+third
+            mormult(ig3) = mormult(ig3)+third
+            mormult(ig4) = mormult(ig4)+third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)
+                mormult(ig)=mormult(ig)+1.d0
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+          end if
+        end do
+      end do
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/ua.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/ua.f
new file mode 100644
index 0000000..5d45c16
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/ua.f
@@ -0,0 +1,264 @@
+c-------------------------------------------------------------------------c
+c                                                                         c
+c        N  A  S     P A R A L L E L     B E N C H M A R K S  3.3         c
+c                                                                         c
+c                      S E R I A L     V E R S I O N                      c
+c                                                                         c
+c                                   U A                                   c
+c                                                                         c
+c-------------------------------------------------------------------------c
+c                                                                         c
+c    This benchmark is a serial version of the NPB UA code.               c
+c    Refer to NAS Technical Report NAS-04-006 for details                 c
+c                                                                         c
+c    Permission to use, copy, distribute and modify this software         c
+c    for any purpose with or without fee is hereby granted.  We           c
+c    request, however, that all derived work reference the NAS            c
+c    Parallel Benchmarks 3.3. This software is provided "as is"           c
+c    without express or implied warranty.                                 c
+c                                                                         c
+c    Information on NPB 3.3, including the technical report, the          c
+c    original specifications, source code, results and information        c
+c    on how to submit new results, is available at:                       c
+c                                                                         c
+c           http://www.nas.nasa.gov/Software/NPB/                         c
+c                                                                         c
+c    Send comments or suggestions to  npb@nas.nasa.gov                    c
+c                                                                         c
+c          NAS Parallel Benchmarks Group                                  c
+c          NASA Ames Research Center                                      c
+c          Mail Stop: T27A-1                                              c
+c          Moffett Field, CA   94035-1000                                 c
+c                                                                         c
+c          E-mail:  npb@nas.nasa.gov                                      c
+c          Fax:     (650) 604-3957                                        c
+c                                                                         c
+c-------------------------------------------------------------------------c
+
+c---------------------------------------------------------------------
+c
+c Author: H. Feng
+c         R. Van der Wijngaart
+c---------------------------------------------------------------------
+
+      program ua
+      include 'header.h'
+
+      integer          step, ie,iside,i,j,k, fstatus
+      external         timer_read
+      double precision timer_read, mflops, tmax, nelt_tot
+      character        class
+      logical          ifmortar, verified
+
+      double precision t2, trecs(t_last)
+      character t_names(t_last)*10
+
+c---------------------------------------------------------------------
+c     Read input file (if it exists), else take
+c     defaults from parameters
+c---------------------------------------------------------------------
+          
+      open (unit=2,file='timer.flag',status='old', iostat=fstatus)
+      if (fstatus .eq. 0) then
+         timeron = .true.
+         t_names(t_total) = 'total'
+         t_names(t_init) = 'init'
+         t_names(t_convect) = 'convect'
+         t_names(t_transfb_c) = 'transfb_c'
+         t_names(t_diffusion) = 'diffusion'
+         t_names(t_transf) = 'transf'
+         t_names(t_transfb) = 'transfb'
+         t_names(t_adaptation) = 'adaptation'
+         t_names(t_transf2) = 'transf+b'
+         t_names(t_add2) = 'add2'
+         close(2)
+      else
+         timeron = .false.
+      endif
+
+      write (*,1000) 
+      open (unit=2,file='inputua.data',status='old', iostat=fstatus)
+
+      if (fstatus .eq. 0) then
+        write(*,233) 
+ 233    format(' Reading from input file inputua.data')
+        read (2,*) fre
+        read (2,*) niter
+        read (2,*) nmxh
+        read (2,*) alpha
+        class = 'U'
+        close(2)
+      else
+        write(*,234) 
+        fre        = fre_default
+        niter      = niter_default
+        nmxh       = nmxh_default
+        alpha      = alpha_default
+        class      = class_default
+      endif
+ 234  format(' No input file inputsp.data. Using compiled defaults')
+
+      dlmin = 0.5d0**refine_max
+      dtime = 0.04d0*dlmin
+
+      write (*,1001) refine_max
+      write (*,1002) fre
+      write (*,1003) niter, dtime
+      write (*,1004) nmxh
+      write (*,1005) alpha
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.3-SER)',
+     >          ' - UA Benchmark', /)
+ 1001 format(' Levels of refinement: ', i8)
+ 1002 format(' Adaptation frequency: ', i8)
+ 1003 format(' Time steps:           ', i8, '    dt: ', g15.6)
+ 1004 format(' CG iterations:        ', i8)
+ 1005 format(' Heat source radius:   ', f8.4,/)
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      if (timeron) call timer_start(t_init)
+
+c.....set up initial mesh (single element) and solution (all zero)
+      call create_initial_grid
+
+      call r_init(ta1,ntot,0.d0)
+      call nr_init(sje,4*6*nelt,0)
+
+c.....compute tables of coefficients and weights      
+      call coef 
+      call geom1
+
+c.....compute the discrete laplacian operators
+      call setdef
+
+c.....prepare for the preconditioner
+      call setpcmo_pre
+
+c.....refine initial mesh and do some preliminary work
+      time = 0.d0
+      call mortar
+      call prepwork
+      call adaptation(ifmortar,0)
+      if (timeron) call timer_stop(t_init)
+
+      call timer_clear(1)
+
+      time = 0.d0
+      do step= 0, niter
+
+        if (step .eq. 1) then
+c.........reset the solution and start the timer, keep track of total no elms
+          call r_init(ta1, ntot, 0.d0)
+          time = 0.d0
+          nelt_tot = 0.d0
+          do i = 1, t_last
+             if (i.ne.t_init) call timer_clear(i)
+          end do
+          call timer_start(1)          
+        endif
+
+c.......advance the convection step 
+        call convect(ifmortar)
+
+        if (timeron) call timer_start(t_transf2)
+c.......prepare the intital guess for cg
+        call transf(tmort,ta1)
+
+c.......compute residual for diffusion term based on intital guess
+
+c.......compute the left hand side of equation, lapacian t
+        do ie = 1,nelt
+          call laplacian(ta2(1,1,1,ie),ta1(1,1,1,ie),size_e(ie))
+        end do
+c.......compute the residual 
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                trhs(i,j,k,ie) = trhs(i,j,k,ie) - ta2(i,j,k,ie)
+              end do
+            end do
+          end do
+        end do
+c.......get the residual on mortar 
+        call transfb(rmor,trhs)
+        if (timeron) call timer_stop(t_transf2)
+
+c.......apply boundary condition: zero out the residual on domain boundaries
+
+c.......apply boundary conidtion to trhs
+        do ie=1,nelt  
+          do iside=1,nsides
+            if (cbc(iside,ie).eq.0) then
+              call facev(trhs(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+c.......apply boundary condition to rmor
+        call col2(rmor,tmmor,nmor)
+
+c.......call the conjugate gradient iterative solver
+        call diffusion(ifmortar)
+
+c.......add convection and diffusion
+        if (timeron) call timer_start(t_add2)
+        call add2(ta1,t,ntot)
+        if (timeron) call timer_stop(t_add2)
+
+c.......perform mesh adaptation
+        time=time+dtime
+        if ((step.ne.0).and.(step/fre*fre .eq. step)) then
+           if (step .ne. niter) then
+             call adaptation(ifmortar,step)
+           end if
+        else
+          ifmortar = .false.
+        end if
+        nelt_tot = nelt_tot + dble(nelt)
+      end do
+
+      call timer_stop(1)
+      tmax = timer_read(1)
+       
+      call verify(class, verified)
+
+c.....compute millions of collocation points advanced per second.
+c.....diffusion: nmxh advancements, convection: 1 advancement
+      mflops = nelt_tot*dble(lx1*lx1*lx1*(nmxh+1))/(tmax*1.d6)
+
+      call print_results('UA', class, refine_max, 0, 0, niter, 
+     &     tmax, mflops, '    coll. point advanced', 
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, 
+     &     cs6, '(none)')
+
+c---------------------------------------------------------------------
+c      More timers
+c---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         if (i.eq.t_transfb_c) then
+            t2 = trecs(t_convect) - trecs(t_transfb_c)
+            write(*,820) 'sub-convect', t2, t2*100./tmax
+         else if (i.eq.t_transfb) then
+            t2 = trecs(t_diffusion) - trecs(t_transf) - trecs(t_transfb)
+            write(*,820) 'sub-diffuse', t2, t2*100./tmax
+         endif
+ 810     format(2x,a10,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a11,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end 
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/utils.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/utils.f
new file mode 100644
index 0000000..101a951
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/utils.f
@@ -0,0 +1,317 @@
+c------------------------------------------------------------------
+      subroutine reciprocal (a, n)
+c------------------------------------------------------------------
+c     initialize double precision array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n)
+
+      do i = 1, n
+        a(i) = 1.d0/a(i)
+      end do
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine r_init (a, n, const)
+c------------------------------------------------------------------
+c     initialize double precision array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine nr_init (a, n, const)
+c------------------------------------------------------------------
+c     initialize integer array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i, a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+c------------------------------------------------------------------
+      subroutine l_init (a, n, const)
+c------------------------------------------------------------------
+c     initialize logical array a with length of n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      logical a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+
+
+c-----------------------------------------------------------------
+      subroutine ncopy (a,b,n)
+c------------------------------------------------------------------
+c     copy array of integers b to a, the length of array is n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      integer a(n),b(n)
+
+      do i = 1, n
+        a(i) = b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine copy (a,b,n)
+c------------------------------------------------------------------
+c     copy double precision array b to a, the length of array is n
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n)
+
+      do i = 1, n
+         a(i) = b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine adds2m1(a,b,c1,n)
+c-----------------------------------------------------------------
+c     a=b*c1
+c-----------------------------------------------------------------
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+      do i=1,n
+        a(i)=a(i)+c1*b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine adds1m1(a,b,c1,n )
+c-----------------------------------------------------------------
+c     a=c1*a+b
+c-----------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+      do i=1,n
+        a(i)=c1*a(i)+b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine col2(a,b,n)
+c------------------------------------------------------------------
+c     a=a*b
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n)
+
+      do i=1,n
+        a(i)=a(i)*b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine nrzero (na,n)
+c------------------------------------------------------------------
+c     zero out array of integers 
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i,na(n)
+
+      do i = 1, n
+        na(i ) = 0
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      subroutine add2(a,b,n)
+c------------------------------------------------------------------
+c     a=a+b
+c------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n)
+      do i=1,n
+        a(i)=a(i)+b(i)
+      end do
+
+      return
+      end
+
+c-----------------------------------------------------------------
+      double precision function calc_norm()
+c------------------------------------------------------------------
+c     calculate the integral of ta1 over the whole domain
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      double precision total,ieltotal
+      integer iel,k,j,i,isize
+
+      total=0.d0
+
+      do iel=1,nelt
+        ieltotal=0.d0
+        isize=size_e(iel)
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              ieltotal=ieltotal+ta1(i,j,k,iel)*w3m1(i,j,k)
+     &                               *jacm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      total=total+ieltotal
+      end do
+
+      calc_norm = total
+
+      return
+      end
+c-----------------------------------------------------------------
+      subroutine parallel_add(frontier)
+c-----------------------------------------------------------------
+c     input array frontier, perform (potentially) parallel add so that
+c     the output frontier(i) has sum of frontier(1)+frontier(2)+...+frontier(i)
+c-----------------------------------------------------------------
+      include 'header.h'
+      integer nellog,i,ahead,ii,ntemp,n1,ntemp1,n2,frontier(lelt),iel
+
+      if (nelt.le.1) return
+
+      nellog=0
+      iel=1
+   10 iel=iel*2
+      nellog=nellog+1
+      if (iel.lt.nelt) goto 10
+
+      ntemp=1
+      do i=1,nellog
+        n1=ntemp*2
+        n2=n1
+        do iel=n1, nelt,n1
+          ahead=frontier(iel-ntemp)
+          do ii=ntemp-1,0,-1
+            frontier(iel-ii)=frontier(iel-ii)+ahead
+          end do
+          n2=iel
+        end do
+        if (n2.le.nelt) n2=n2+n1
+
+        ntemp1=n2-nelt
+        if (ntemp1.lt.ntemp) then
+          ahead=frontier(n2-ntemp)
+          do ii=ntemp-1,ntemp1,-1
+            frontier(n2-ii)=frontier(n2-ii)+ahead
+          end do
+        endif
+
+        ntemp=n1
+      end do
+
+      return
+      end 
+
+c------------------------------------------------------------------
+      subroutine dssum
+
+c------------------------------------------------------------------
+c     Perform stiffness summation: element-mortar-element mapping
+c------------------------------------------------------------------
+
+      include 'header.h'
+
+      call transfb(dpcmor,dpcelm)
+      call transf (dpcmor,dpcelm)
+
+      return
+      end
+
+c------------------------------------------------------------------
+      subroutine facev(a,iface,val)
+c------------------------------------------------------------------
+c     assign the value val to face(iface,iel) of array a.
+c------------------------------------------------------------------
+      include 'header.h'
+
+      double precision a(lx1,lx1,lx1), val
+      integer iface, kx1, kx2, ky1, ky2, kz1, kz2, ix, iy, iz
+
+      kx1=1
+      ky1=1
+      kz1=1
+      kx2=lx1
+      ky2=lx1
+      kz2=lx1
+      if (iface.eq.1) kx1=lx1
+      if (iface.eq.2) kx2=1
+      if (iface.eq.3) ky1=lx1
+      if (iface.eq.4) ky2=1
+      if (iface.eq.5) kz1=lx1
+      if (iface.eq.6) kz2=1
+
+      do ix = kx1, kx2
+        do iy = ky1, ky2
+          do iz = kz1, kz2
+            a(ix,iy,iz)=val
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/verify.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/verify.f
new file mode 100644
index 0000000..189080a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/UA/verify.f
@@ -0,0 +1,88 @@
+      subroutine verify(class, verified)
+
+      include 'header.h'
+
+      double precision norm, calc_norm, epsilon, norm_dif, norm_ref
+      external         calc_norm
+      character        class
+      logical          verified
+       
+c.....tolerance level
+      epsilon = 1.0d-08
+
+c.....compute the temperature integral over the whole domain
+      norm = calc_norm()
+
+      verified = .true.
+      if     ( class .eq. 'S' ) then
+        norm_ref = 0.1890013110962D-02
+      elseif ( class .eq. 'W' ) then
+        norm_ref = 0.2569794837076D-04
+      elseif ( class .eq. 'A' ) then
+        norm_ref = 0.8939996281443D-04
+      elseif ( class .eq. 'B' ) then
+        norm_ref = 0.4507561922901D-04
+      elseif ( class .eq. 'C' ) then
+        norm_ref = 0.1544736587100D-04
+      elseif ( class .eq. 'D' ) then
+        norm_ref = 0.1577586272355D-05
+      else
+        class = 'U'
+        norm_ref = 1.d0
+        verified = .false.
+      endif         
+
+      norm_dif = dabs((norm - norm_ref)/norm_ref)
+
+c---------------------------------------------------------------------
+c    Output the comparison of computed results to known cases.
+c---------------------------------------------------------------------
+
+      print *
+
+      if (class .ne. 'U') then
+         write(*, 1990) class
+ 1990    format(' Verification being performed for class ', a)
+         write (*,2000) epsilon
+ 2000    format(' accuracy setting for epsilon = ', E20.13)
+      else 
+         write(*, 1995)
+ 1995    format(' Unknown class')
+      endif
+
+      if (class .ne. 'U') then
+         write (*,2001) 
+      else
+         write (*, 2005)
+      endif
+
+ 2001 format(' Comparison of temperature integrals')
+ 2005 format(' Temperature integral')
+      if (class .eq. 'U') then
+         write(*, 2015) norm
+      else if (norm_dif .le. epsilon) then
+         write (*,2011) norm, norm_ref, norm_dif
+      else 
+         verified = .false.
+         write (*,2010) norm, norm_ref, norm_dif
+      endif
+
+ 2010 format(' FAILURE: ', E20.13, E20.13, E20.13)
+ 2011 format('          ', E20.13, E20.13, E20.13)
+ 2015 format('          ', E20.13)
+        
+      if (class .eq. 'U') then
+        write(*, 2022)
+        write(*, 2023)
+ 2022   format(' No reference values provided')
+ 2023   format(' No verification performed')
+      else if (verified) then
+        write(*, 2020)
+ 2020   format(' Verification Successful')
+      else
+        write(*, 2021)
+ 2021   format(' Verification failed')
+      endif
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_print_results.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_print_results.c
new file mode 100644
index 0000000..34d7e5f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_print_results.c
@@ -0,0 +1,88 @@
+/*****************************************************************/
+/******     C  _  P  R  I  N  T  _  R  E  S  U  L  T  S     ******/
+/*****************************************************************/
+#include <stdlib.h>
+#include <stdio.h>
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags )
+{
+    printf( "\n\n %s Benchmark Completed\n", name ); 
+
+    printf( " Class           =                        %c\n", class );
+
+    if( n3 == 0 ) {
+        long nn = n1;
+        if ( n2 != 0 ) nn *= n2;
+        printf( " Size            =             %12ld\n", nn );   /* as in IS */
+    }
+    else
+        printf( " Size            =             %4dx%4dx%4d\n", n1,n2,n3 );
+
+    printf( " Iterations      =             %12d\n", niter );
+ 
+    printf( " Time in seconds =             %12.2f\n", t );
+
+    printf( " Mop/s total     =             %12.2f\n", mops );
+
+    printf( " Operation type  = %24s\n", optype);
+
+    if( passed_verification < 0 )
+        printf( " Verification    =            NOT PERFORMED\n" );
+    else if( passed_verification )
+        printf( " Verification    =               SUCCESSFUL\n" );
+    else
+        printf( " Verification    =             UNSUCCESSFUL\n" );
+
+    printf( " Version         =             %12s\n", npbversion );
+
+    printf( " Compile date    =             %12s\n", compiletime );
+
+    printf( "\n Compile options:\n" );
+
+    printf( "    CC           = %s\n", cc );
+
+    printf( "    CLINK        = %s\n", clink );
+
+    printf( "    C_LIB        = %s\n", c_lib );
+
+    printf( "    C_INC        = %s\n", c_inc );
+
+    printf( "    CFLAGS       = %s\n", cflags );
+
+    printf( "    CLINKFLAGS   = %s\n", clinkflags );
+#ifdef SMP
+    evalue = getenv("MP_SET_NUMTHREADS");
+    printf( "   MULTICPUS = %s\n", evalue );
+#endif
+
+    printf( "\n\n" );
+    printf( " Please send all errors/feedbacks to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " npb@nas.nasa.gov\n\n\n" );
+/*    printf( " Please send the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " Internet: npb@nas.nasa.gov\n \n" );
+    printf( " If email is not available, send this to:\n\n" );
+    printf( " MS T27A-1\n" );
+    printf( " NASA Ames Research Center\n" );
+    printf( " Moffett Field, CA  94035-1000\n\n" );
+    printf( " Fax: 650-604-3957\n\n" ); */
+}
+ 
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_timers.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_timers.c
new file mode 100644
index 0000000..995d5d6
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/c_timers.c
@@ -0,0 +1,61 @@
+#include "wtime.h"
+#include <stdlib.h>
+
+/*  Prototype  */
+void wtime( double * );
+
+
+/*****************************************************************/
+/******         E  L  A  P  S  E  D  _  T  I  M  E          ******/
+/*****************************************************************/
+double elapsed_time( void )
+{
+    double t;
+
+    wtime( &t );
+    return( t );
+}
+
+
+double start[64], elapsed[64];
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  C  L  E  A  R          ******/
+/*****************************************************************/
+void timer_clear( int n )
+{
+    elapsed[n] = 0.0;
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  A  R  T          ******/
+/*****************************************************************/
+void timer_start( int n )
+{
+    start[n] = elapsed_time();
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  O  P             ******/
+/*****************************************************************/
+void timer_stop( int n )
+{
+    double t, now;
+
+    now = elapsed_time();
+    t = now - start[n];
+    elapsed[n] += t;
+
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  R  E  A  D             ******/
+/*****************************************************************/
+double timer_read( int n )
+{
+    return( elapsed[n] );
+}
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/print_results.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/print_results.f
new file mode 100644
index 0000000..d2fe91e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/print_results.f
@@ -0,0 +1,111 @@
+
+      subroutine print_results(name, class, n1, n2, n3, niter, 
+     >               t, mops, optype, verified, npbversion, 
+     >               compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      
+      implicit none
+      character name*(*)
+      character class*1
+      integer   n1, n2, n3, niter, j
+      double precision t, mops
+      character optype*24, size*15
+      logical   verified
+      character*(*) npbversion, compiletime, 
+     >              cs1, cs2, cs3, cs4, cs5, cs6, cs7
+
+         write (*, 2) name 
+ 2       format(//, ' ', A, ' Benchmark Completed.')
+
+         write (*, 3) Class
+ 3       format(' Class           = ', 12x, a12)
+
+c   If this is not a grid-based problem (EP, FT, CG), then
+c   we only print n1, which contains some measure of the
+c   problem size. In that case, n2 and n3 are both zero.
+c   Otherwise, we print the grid size n1xn2xn3
+
+         if ((n2 .eq. 0) .and. (n3 .eq. 0)) then
+            if (name(1:2) .eq. 'EP') then
+               write(size, '(f15.0)' ) 2.d0**n1
+               j = 15
+               if (size(j:j) .eq. '.') then
+                  size(j:j) = ' '
+                  j = j - 1
+               endif
+               write (*,42) size(1:j)
+ 42            format(' Size            = ',9x, a15)
+            else
+               write (*,44) n1
+ 44            format(' Size            = ',12x, i12)
+            endif
+         else
+            write (*, 4) n1,n2,n3
+ 4          format(' Size            =  ',9x, i4,'x',i4,'x',i4)
+         endif
+
+         write (*, 5) niter
+ 5       format(' Iterations      = ', 12x, i12)
+         
+         write (*, 6) t
+ 6       format(' Time in seconds = ',12x, f12.2)
+         
+         write (*,9) mops
+ 9       format(' Mop/s total     = ',12x, f12.2)
+
+         write(*, 11) optype
+ 11      format(' Operation type  = ', a24)
+
+         if (verified) then 
+            write(*,12) '  SUCCESSFUL'
+         else
+            write(*,12) 'UNSUCCESSFUL'
+         endif
+ 12      format(' Verification    = ', 12x, a)
+
+         write(*,13) npbversion
+ 13      format(' Version         = ', 12x, a12)
+
+         write(*,14) compiletime
+ 14      format(' Compile date    = ', 12x, a12)
+
+
+         write (*,121) cs1
+ 121     format(/, ' Compile options:', /, 
+     >          '    F77          = ', A)
+
+         write (*,122) cs2
+ 122     format('    FLINK        = ', A)
+
+         write (*,123) cs3
+ 123     format('    F_LIB        = ', A)
+
+         write (*,124) cs4
+ 124     format('    F_INC        = ', A)
+
+         write (*,125) cs5
+ 125     format('    FFLAGS       = ', A)
+
+         write (*,126) cs6
+ 126     format('    FLINKFLAGS   = ', A)
+
+         write(*, 127) cs7
+ 127     format('    RAND         = ', A)
+        
+         write (*,130)
+ 130     format(//' Please send all errors/feedbacks to:'//
+     >            ' NPB Development Team'/
+     >            ' npb@nas.nasa.gov'//)
+c 130     format(//' Please send the results of this run to:'//
+c     >            ' NPB Development Team '/
+c     >            ' Internet: npb@nas.nasa.gov'/
+c     >            ' '/
+c     >            ' If email is not available, send this to:'//
+c     >            ' MS T27A-1'/
+c     >            ' NASA Ames Research Center'/
+c     >            ' Moffett Field, CA  94035-1000'//
+c     >            ' Fax: 650-604-3957'//)
+
+
+         return
+         end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdp.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdp.f
new file mode 100644
index 0000000..64860d9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdp.f
@@ -0,0 +1,137 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function randlc (x, a)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+
+      return
+      end
+
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to VRANLC using the same arguments will generate a
+c   continuous sequence.  If N is zero, only initialization is performed, and
+c   the variables X, A and Y are ignored.
+c
+c   This routine is the standard version designed for scalar or RISC systems.
+c   However, it should produce the same results on any single processor
+c   computer with at least 48 mantissa bits in double precision floating point
+c   data.  On 64 bit systems, double precision should be disabled.
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      integer i,n
+      double precision y,r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      dimension y(*)
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Generate N results.   This loop is not vectorizable.
+c---------------------------------------------------------------------
+      do i = 1, n
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+        t1 = r23 * x
+        x1 = int (t1)
+        x2 = x - t23 * x1
+        t1 = a1 * x2 + a2 * x1
+        t2 = int (r23 * t1)
+        z = t1 - t23 * t2
+        t3 = t23 * z + a2 * x2
+        t4 = int (r46 * t3)
+        x = t3 - t46 * t4
+        y(i) = r46 * x
+      enddo
+
+      return
+      end
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdpvec.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdpvec.f
new file mode 100644
index 0000000..c708071
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randdpvec.f
@@ -0,0 +1,186 @@
+c---------------------------------------------------------------------
+      double precision function randlc (x, a)
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+c
+c   This routine should produce the same results on any computer with at least
+c   48 mantissa bits in double precision floating point data.  On 64 bit
+c   systems, double precision should be disabled.
+c
+c   David H. Bailey     October 26, 1990
+c
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,
+     >  t46 = t23 ** 2)
+
+c---------------------------------------------------------------------
+c   Break A into two parts such that A = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+c---------------------------------------------------------------------
+c   Break X into two parts such that X = 2^23 * X1 + X2, compute
+c   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+c   X = 2^23 * Z + A2 * X2  (mod 2^46).
+c---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+
+
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+      return
+      end
+
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+c---------------------------------------------------------------------
+c   This routine generates N uniform pseudorandom double precision numbers in
+c   the range (0, 1) by using the linear congruential generator
+c   
+c   x_{k+1} = a x_k  (mod 2^46)
+c   
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The N results are placed in Y and are normalized
+c   to be between 0 and 1.  X is updated to contain the new seed, so that
+c   subsequent calls to RANDLC using the same arguments will generate a
+c   continuous sequence.
+c   
+c   This routine generates the output sequence in batches of length NV, for
+c   convenience on vector computers.  This routine should produce the same
+c   results on any computer with at least 48 mantissa bits in double precision
+c   floating point data.  On Cray systems, double precision should be disabled.
+c   
+c   David H. Bailey    August 30, 1990
+c---------------------------------------------------------------------
+
+      integer n
+      double precision x, a, y(*)
+      
+      double precision r23, r46, t23, t46
+      integer nv
+      parameter (r23 = 2.d0 ** (-23), r46 = r23 * r23, t23 = 2.d0 ** 23,
+     >     t46 = t23 * t23, nv = 64)
+      double precision  xv(nv), t1, t2, t3, t4, an, a1, a2, x1, x2, yy
+      integer n1, i, j
+      external randlc
+      double precision randlc
+
+c---------------------------------------------------------------------
+c     Compute the first NV elements of the sequence using RANDLC.
+c---------------------------------------------------------------------
+      t1 = x
+      n1 = min (n, nv)
+
+      do  i = 1, n1
+         xv(i) = t46 * randlc (t1, a)
+      enddo
+
+c---------------------------------------------------------------------
+c     It is not necessary to compute AN, A1 or A2 unless N is greater than NV.
+c---------------------------------------------------------------------
+      if (n .gt. nv) then
+
+c---------------------------------------------------------------------
+c     Compute AN = AA ^ NV (mod 2^46) using successive calls to RANDLC.
+c---------------------------------------------------------------------
+         t1 = a
+         t2 = r46 * a
+
+         do  i = 1, nv - 1
+            t2 = randlc (t1, a)
+         enddo
+
+         an = t46 * t2
+
+c---------------------------------------------------------------------
+c     Break AN into two parts such that AN = 2^23 * A1 + A2.
+c---------------------------------------------------------------------
+         t1 = r23 * an
+         a1 = aint (t1)
+         a2 = an - t23 * a1
+      endif
+
+c---------------------------------------------------------------------
+c     Compute N pseudorandom results in batches of size NV.
+c---------------------------------------------------------------------
+      do  j = 0, n - 1, nv
+         n1 = min (nv, n - j)
+
+c---------------------------------------------------------------------
+c     Compute up to NV results based on the current seed vector XV.
+c---------------------------------------------------------------------
+         do  i = 1, n1
+            y(i+j) = r46 * xv(i)
+         enddo
+
+c---------------------------------------------------------------------
+c     If this is the last pass through the 140 loop, it is not necessary to
+c     update the XV vector.
+c---------------------------------------------------------------------
+         if (j + n1 .eq. n) goto 150
+
+c---------------------------------------------------------------------
+c     Update the XV vector by multiplying each element by AN (mod 2^46).
+c---------------------------------------------------------------------
+         do  i = 1, nv
+            t1 = r23 * xv(i)
+            x1 = aint (t1)
+            x2 = xv(i) - t23 * x1
+            t1 = a1 * x2 + a2 * x1
+            t2 = aint (r23 * t1)
+            yy = t1 - t23 * t2
+            t3 = t23 * yy + a2 * x2
+            t4 = aint (r46 * t3)
+            xv(i) = t3 - t46 * t4
+         enddo
+
+      enddo
+
+c---------------------------------------------------------------------
+c     Save the last seed in X so that subsequent calls to VRANLC will generate
+c     a continuous sequence.
+c---------------------------------------------------------------------
+ 150  x = xv(n1)
+
+      return
+      end
+
+c----- end of program ------------------------------------------------
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8.f
new file mode 100644
index 0000000..21ab881
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8.f
@@ -0,0 +1,79 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+      Lx = X
+      La = A
+
+      Lx   = iand(Lx*La,i246m1)
+      randlc = d2m46*dble(Lx)
+      x    = dble(Lx)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 i246m1, Lx, La
+      double precision d2m46
+
+c This doesn't work, because the compiler does the calculation in 32
+c bits and overflows. No standard way (without f90 stuff) to specify
+c that the rhs should be done in 64 bit arithmetic. 
+c      parameter(i246m1=2**46-1)
+
+      parameter(d2m46=0.5d0**46)
+
+      save i246m1
+      data i246m1/X'00003FFFFFFFFFFF'/
+
+c Note that the v6 compiler on an R8000 does something stupid with
+c the above. Using the following instead (or various other things)
+c makes the calculation run almost 10 times as fast. 
+c 
+c      save d2m46
+c      data d2m46/0.0d0/
+c      if (d2m46 .eq. 0.0d0) then
+c         d2m46 = 0.5d0**46
+c      endif
+
+      Lx = X
+      La = A
+      do i = 1, N
+         Lx   = iand(Lx*La,i246m1)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x    = dble(Lx)
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8_safe.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8_safe.f
new file mode 100644
index 0000000..f725b6a
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/randi8_safe.f
@@ -0,0 +1,64 @@
+      double precision function randlc(x, a)
+
+c---------------------------------------------------------------------
+c
+c   This routine returns a uniform pseudorandom double precision number in the
+c   range (0, 1) by using the linear congruential generator
+c
+c   x_{k+1} = a x_k  (mod 2^46)
+c
+c   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+c   before repeating.  The argument A is the same as 'a' in the above formula,
+c   and X is the same as x_0.  A and X must be odd double precision integers
+c   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+c   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+c   the new seed x_1, so that subsequent calls to RANDLC using the same
+c   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = x
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      x1 = ibits(Lx, 23, 23)
+      x2 = ibits(Lx, 0, 23)
+      xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+      Lx   = ibits(xa,0, 46)
+      x    = dble(Lx)
+      randlc = d2m46*x
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer*8 Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = X
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      do i = 1, N
+         x1 = ibits(Lx, 23, 23)
+         x2 = ibits(Lx, 0, 23)
+         xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+         Lx   = ibits(xa,0, 46)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x = dble(Lx)
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/timers.f b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/timers.f
new file mode 100644
index 0000000..59a0888
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/timers.f
@@ -0,0 +1,108 @@
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+      
+      subroutine timer_clear(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+
+      elapsed(n) = 0.0
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_start(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      external         elapsed_time
+      double precision elapsed_time
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+
+      start(n) = elapsed_time()
+
+      return
+      end
+      
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      subroutine timer_stop(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      external         elapsed_time
+      double precision elapsed_time
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+      double precision t, now
+      now = elapsed_time()
+      t = now - start(n)
+      elapsed(n) = elapsed(n) + t
+
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function timer_read(n)
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+      integer n
+      double precision start(64), elapsed(64)
+      common /tt/ start, elapsed
+      
+      timer_read = elapsed(n)
+      return
+      end
+
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      double precision function elapsed_time()
+
+c---------------------------------------------------------------------
+c---------------------------------------------------------------------
+
+      implicit none
+
+      double precision t
+
+c This function must measure wall clock time, not CPU time. 
+c Since there is no portable timer in Fortran (77)
+c we call a routine compiled in C (though the C source may have
+c to be tweaked). 
+      call wtime(t)
+c The following is not ok for "official" results because it reports
+c CPU time not wall clock time. It may be useful for developing/testing
+c on timeshared Crays, though. 
+c     call second(t)
+
+      elapsed_time = t
+
+      return
+      end
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.c
new file mode 100644
index 0000000..53fea18
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.c
@@ -0,0 +1,16 @@
+#include "wtime.h"
+#include <time.h>
+#ifndef DOS
+#include <sys/time.h>
+#endif
+
+void wtime(double *t)
+{
+  static int sec = -1;
+  struct timeval tv;
+  gettimeofday(&tv, (void *)0);
+  if (sec < 0) sec = tv.tv_sec;
+  *t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
+}
+
+    
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.h b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.h
new file mode 100644
index 0000000..12eb0cb
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime.h
@@ -0,0 +1,12 @@
+/* C/Fortran interface is different on different machines. 
+ * You may need to tweak this.
+ */
+
+
+#if defined(IBM)
+#define wtime wtime
+#elif defined(CRAY)
+#define wtime WTIME
+#else
+#define wtime wtime_
+#endif
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime_sgi64.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime_sgi64.c
new file mode 100644
index 0000000..d08d50c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/common/wtime_sgi64.c
@@ -0,0 +1,74 @@
+#include <sys/types.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syssgi.h>
+#include <sys/immu.h>
+#include <errno.h>
+#include <stdio.h>
+
+/* The following works on SGI Power Challenge systems */
+
+typedef unsigned long iotimer_t;
+
+unsigned int cycleval;
+volatile iotimer_t *iotimer_addr, base_counter;
+double resolution;
+
+/* address_t is an integer type big enough to hold an address */
+typedef unsigned long address_t;
+
+
+
+void timer_init() 
+{
+  
+  int fd;
+  char *virt_addr;
+  address_t phys_addr, page_offset, pagemask, pagebase_addr;
+  
+  pagemask = getpagesize() - 1;
+  errno = 0;
+  phys_addr = syssgi(SGI_QUERY_CYCLECNTR, &cycleval);
+  if (errno != 0) {
+    perror("SGI_QUERY_CYCLECNTR");
+    exit(1);
+  }
+  /* rel_addr = page offset of physical address */
+  page_offset = phys_addr & pagemask;
+  pagebase_addr = phys_addr - page_offset;
+  fd = open("/dev/mmem", O_RDONLY);
+
+  virt_addr = mmap(0, pagemask, PROT_READ, MAP_PRIVATE, fd, pagebase_addr);
+  virt_addr = virt_addr + page_offset;
+  iotimer_addr = (iotimer_t *)virt_addr;
+  /* cycleval in picoseconds to this gives resolution in seconds */
+  resolution = 1.0e-12*cycleval; 
+  base_counter = *iotimer_addr;
+}
+
+void wtime_(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
+void wtime(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/README
new file mode 100644
index 0000000..ae535e9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/README
@@ -0,0 +1,7 @@
+This directory contains examples of make.def files that were used 
+by the NPB team in testing the benchmarks on different platforms. 
+They can be used as starting points for make.def files for your 
+own platform, but you may need to taylor them for best performance 
+on your installation. A clean template can be found in directory 
+`config'.
+Some examples of suite.def files are also provided.
\ No newline at end of file
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_crayx1 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_crayx1
new file mode 100644
index 0000000..1e89d2d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_crayx1
@@ -0,0 +1,143 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAYX1
+
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_gcc_x86 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_gcc_x86
new file mode 100644
index 0000000..8039cbf
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_gcc_x86
@@ -0,0 +1,167 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = gfortran
+
+#---------------------------------------------------------------------------
+# This links fortran programs; usually the same as ${F77}
+#---------------------------------------------------------------------------
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+
+#---------------------------------------------------------------------------
+# This links C programs; usually the same as ${CC}
+#---------------------------------------------------------------------------
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -mcmodel=medium
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+# RAND   = randi8
+# The following is highly reliable but may be slow:
+RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm
new file mode 100644
index 0000000..fc85c8f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = xlf_r
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+#F_LIB  = -lmass
+F_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qnosave
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -bmaxdata:0x80000000 -bmaxstack:0x10000000
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = xlc_r
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm64
new file mode 100644
index 0000000..d5fd22d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_ibm64
@@ -0,0 +1,167 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = xlf -q64
+
+#---------------------------------------------------------------------------
+# This links fortran programs; usually the same as ${F77}
+#---------------------------------------------------------------------------
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qarch=auto -qtune=auto -qhot -qnosave
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = xlc -q64
+
+#---------------------------------------------------------------------------
+# This links C programs; usually the same as ${CC}
+#---------------------------------------------------------------------------
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qarch=auto -qtune=auto -qhot
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_intel b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_intel
new file mode 100644
index 0000000..d6532f9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_intel
@@ -0,0 +1,149 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = ifort
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = icc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= icc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_pgi b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_pgi
new file mode 100644
index 0000000..1da658e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_pgi
@@ -0,0 +1,149 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = pgf90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = pgcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= pgcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi
new file mode 100644
index 0000000..d04927e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi64
new file mode 100644
index 0000000..7abcfcc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sgi64
@@ -0,0 +1,153 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+# (Note these definitions are inconsistent with NPB2.1.)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77 -64
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc -64
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime_sgi64.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun
new file mode 100644
index 0000000..78f360c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun
@@ -0,0 +1,152 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -fast
+#FLINKFLAGS =
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fast
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -fast
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun64 b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun64
new file mode 100644
index 0000000..fa605d4
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/make.def_sun64
@@ -0,0 +1,151 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f90
+# This links fortran programs; usually the same as ${F77}
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast -xarch=native64
+# FFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -fast -xarch=native64
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fast -xarch=native64
+# CFLAGS = -g
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -fast -xarch=native64
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB3.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.bt b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.bt
new file mode 100644
index 0000000..66d59b0
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.bt
@@ -0,0 +1,6 @@
+bt	S
+bt	W
+bt	A
+bt	B
+bt	C
+bt	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.cg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.cg
new file mode 100644
index 0000000..c960817
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.cg
@@ -0,0 +1,6 @@
+cg	S
+cg	W
+cg	A
+cg	B
+cg	C
+cg	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ep b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ep
new file mode 100644
index 0000000..a0491d3
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ep
@@ -0,0 +1,6 @@
+ep	S
+ep	W
+ep	A
+ep	B
+ep	C
+ep	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ft b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ft
new file mode 100644
index 0000000..100ae4f
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.ft
@@ -0,0 +1,6 @@
+ft	S
+ft	W
+ft	A
+ft	B
+ft	C
+ft	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.is b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.is
new file mode 100644
index 0000000..3a0b05d
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.is
@@ -0,0 +1,5 @@
+is	S
+is	W
+is	A
+is	B
+is	C
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.lu b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.lu
new file mode 100644
index 0000000..583de7e
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.lu
@@ -0,0 +1,6 @@
+lu	S
+lu	W
+lu	A
+lu	B
+lu	C
+lu	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.mg b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.mg
new file mode 100644
index 0000000..1df86a9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.mg
@@ -0,0 +1,6 @@
+mg	S
+mg	W
+mg	A
+mg	B
+mg	C
+mg	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.sp b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.sp
new file mode 100644
index 0000000..8b5a9ba
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/NAS.samples/suite.def.sp
@@ -0,0 +1,6 @@
+sp	S
+sp	W
+sp	A
+sp	B
+sp	C
+sp	D
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/make.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/make.def.template
new file mode 100644
index 0000000..b4a5181
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/make.def.template
@@ -0,0 +1,167 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# F77        - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
+#                            $(F77) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+F77 = f77
+
+#---------------------------------------------------------------------------
+# This links fortran programs; usually the same as ${F77}
+#---------------------------------------------------------------------------
+FLINK	= $(F77)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc
+
+#---------------------------------------------------------------------------
+# This links C programs; usually the same as ${CC}
+#---------------------------------------------------------------------------
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= cc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/suite.def.template b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/suite.def.template
new file mode 100644
index 0000000..fda5d0b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/config/suite.def.template
@@ -0,0 +1,22 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command. 
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file. 
+# Each line of this file contains a benchmark name, class.
+# The name is one of "cg", "is", "dc", "ep", mg", "ft", "sp",
+#  "bt", "lu", and "ua". 
+# The class is one of "S", "W", "A", "B", and "C"
+# (classes D and E are defined for a number of benchmarks, but they
+#  are likely not practical to run in serial. See README.install).
+# No blank lines. 
+# The following example builds serial sample sizes of all benchmarks. 
+ft	S
+mg	S
+sp	S
+lu	S
+bt	S
+is	S
+ep	S
+cg	S
+ua	S
+dc      S
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/Makefile b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/Makefile
new file mode 100644
index 0000000..b0bf4e9
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/Makefile
@@ -0,0 +1,22 @@
+UCC = cc
+include ../config/make.def
+
+# Note that COMPILE is also defined in make.common and should
+# be the same. We can't include make.common because it has a lot
+# of other garbage. 
+FCOMPILE = $(F77) -c $(F_INC) $(FFLAGS)
+
+all: setparams 
+
+# setparams creates an npbparam.h file for each benchmark 
+# configuration. npbparams.h also contains info about how a benchmark
+# was compiled and linked
+
+setparams: setparams.c ../config/make.def
+	$(UCC) ${CONVERTFLAG} -o setparams setparams.c
+
+
+clean: 
+	-rm -f setparams setparams.h npbparams.h
+	-rm -f *~ *.o
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/README
new file mode 100644
index 0000000..ede69b5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/README
@@ -0,0 +1,41 @@
+This directory contains utilities and files used by the 
+build process. You should not need to change anything
+in this directory. 
+
+Original Files
+--------------
+setparams.c:
+        Source for the setparams program. This program is used internally
+        in the build process to create the file "npbparams.h" for each 
+        benchmark. npbparams.h contains Fortran or C parameters to build a 
+        benchmark for a specific class. The setparams program is never run 
+        directly by a user. Its invocation syntax is 
+
+            "setparams benchmark-name class". 
+
+        It examines the file "npbparams.h" in the current directory. If 
+        the specified parameters are the same as those in the npbparams.h 
+        file, nothing it changed. If the file does not exist or corresponds 
+        to a different class/number of nodes, it is (re)built. 
+	One of the more complicated things in npbparams.h is that it 
+        contains, in a Fortran string, the compiler flags used to build a 
+        benchmark, so that a benchmark can print out how it was compiled. 
+
+make.common
+        A makefile segment that is included in each individual benchmark
+        program makefile. It sets up some standard macros (COMPILE, etc) 
+        and makes sure everything is configured correctly (npbparams.h)
+
+Makefile
+        Builds  setparams
+
+README
+        This file. 
+
+
+Created files
+-------------
+
+setparams
+	See descriptions above
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/make.common b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/make.common
new file mode 100644
index 0000000..e089415
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/make.common
@@ -0,0 +1,63 @@
+PROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).x
+FCOMPILE = $(F77) -c $(F_INC) $(FFLAGS)
+CCOMPILE = $(CC)  -c $(C_INC) $(CFLAGS)
+CCOMPILE_pp = $(CC_pp)  -c $(C_INC_pp) $(CFLAGS_pp)
+
+# Class "U" is used internally by the setparams program to mean
+# "unknown". This means that if you don't specify CLASS=
+# on the command line, you'll get an error. It would be nice
+# to be able to avoid this, but we'd have to get information
+# from the setparams back to the make program, which isn't easy. 
+CLASS=U
+
+default:: ${PROGRAM}
+
+# This makes sure the configuration utility setparams 
+# is up to date. 
+# Note that this must be run every time, which is why the
+# target does not exist and is not created. 
+# If you create a file called "config" you will break things. 
+config:
+	@cd ../sys; ${MAKE} all
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+COMMON=../common
+${COMMON}/${RAND}.o: ${COMMON}/${RAND}.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} ${RAND}.f
+
+${COMMON}/print_results.o: ${COMMON}/print_results.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} print_results.f
+
+${COMMON}/c_print_results.o: ${COMMON}/c_print_results.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_print_results.c
+
+${COMMON}/timers.o: ${COMMON}/timers.f ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} timers.f
+
+${COMMON}/c_timers.o: ${COMMON}/c_timers.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_timers.c
+
+${COMMON}/wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} ${MACHINE} -o wtime.o ${COMMON}/${WTIME}
+# For most machines or CRAY or IBM
+#	cd ${COMMON}; ${CCOMPILE} ${MACHINE} ${COMMON}/wtime.c
+# For a precise timer on an SGI Power Challenge, try:
+#	cd ${COMMON}; ${CCOMPILE} -o wtime.o ${COMMON}/wtime_sgi64.c
+
+${COMMON}/c_wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} -o c_wtime.o ${COMMON}/${WTIME}
+
+
+# Normally setparams updates npbparams.h only if the settings (CLASS)
+# have changed. However, we also want to update if the compile options
+# may have changed (set in ../config/make.def). 
+npbparams.h: ../config/make.def
+	@ echo make.def modified. Rebuilding npbparams.h just in case
+	rm -f npbparams.h
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+# So that "make benchmark-name" works
+${BENCHMARK}:  default
+${BENCHMARKU}: default
+
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_header b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_header
new file mode 100755
index 0000000..82661ad
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_header
@@ -0,0 +1,6 @@
+echo '   ==========================================='
+echo '   =      NAS PARALLEL BENCHMARKS 3.3        ='
+echo '   =      Serial Versions                    ='
+echo '   =      F77/C                              ='
+echo '   ==========================================='
+echo ''
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_instructions b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_instructions
new file mode 100755
index 0000000..b197366
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/print_instructions
@@ -0,0 +1,19 @@
+echo ''
+echo '   To make a NAS benchmark type '
+echo ''
+echo '         make <benchmark-name> CLASS=<class>'
+echo ''
+echo '   where <benchmark-name> is "bt", "cg", "ep", "ft", "is", "lu",'
+echo '                             "lu-hp", "mg", "sp", or "ua"'
+echo '         <class>          is "S", "W", "A", "B", "C" or "D"'
+echo ''
+echo '   To make a set of benchmarks, create the file config/suite.def'
+echo '   according to the instructions in config/suite.def.template and type'
+echo ''
+echo '         make suite'
+echo ''
+echo ' ***************************************************************'
+echo ' * Remember to edit the file config/make.def for site specific *'
+echo ' * information as described in the README file                 *'
+echo ' ***************************************************************'
+
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/setparams.c b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/setparams.c
new file mode 100644
index 0000000..98725a5
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/setparams.c
@@ -0,0 +1,1053 @@
+/* 
+ * This utility configures a NPB to be built for a specific class. 
+ * It creates a file "npbparams.h" 
+ * in the source directory. This file keeps state information about 
+ * which size of benchmark is currently being built (so that nothing
+ * if unnecessarily rebuilt) and defines (through PARAMETER statements)
+ * the number of nodes and class for which a benchmark is being built. 
+
+ * The utility takes 3 arguments: 
+ *       setparams benchmark-name class
+ *    benchmark-name is "sp", "bt", etc
+ *    class is the size of the benchmark
+ * These parameters are checked for the current benchmark. If they
+ * are invalid, this program prints a message and aborts. 
+ * If the parameters are ok, the current npbsize.h (actually just
+ * the first line) is read in. If the new parameters are the same as 
+ * the old, nothing is done, but an exit code is returned to force the
+ * user to specify (otherwise the make procedure succeeds but builds a
+ * binary of the wrong name).  Otherwise the file is rewritten. 
+ * Errors write a message (to stdout) and abort. 
+ * 
+ * This program makes use of two extra benchmark "classes"
+ * class "X" means an invalid specification. It is returned if
+ * there is an error parsing the config file. 
+ * class "U" is an external specification meaning "unknown class"
+ * 
+ * Unfortunately everything has to be case sensitive. This is
+ * because we can always convert lower to upper or v.v. but
+ * can't feed this information back to the makefile, so typing
+ * make CLASS=a and make CLASS=A will produce different binaries.
+ *
+ * 
+ */
+
+#include <sys/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <time.h>
+
+/*
+ * This is the master version number for this set of 
+ * NPB benchmarks. It is in an obscure place so people
+ * won't accidentally change it. 
+ */
+
+#define VERSION "3.3.1"
+
+/* controls verbose output from setparams */
+/* #define VERBOSE */
+
+#define FILENAME "npbparams.h"
+#define DESC_LINE "c CLASS = %c\n"
+#define DEF_CLASS_LINE     "#define CLASS '%c'\n"
+#define FINDENT  "        "
+#define CONTINUE "     > "
+
+void get_info(char *argv[], int *typep, char *classp);
+void check_info(int type, char class);
+void read_info(int type, char *classp);
+void write_info(int type, char class);
+void write_sp_info(FILE *fp, char class);
+void write_bt_info(FILE *fp, char class);
+void write_lu_info(FILE *fp, char class);
+void write_mg_info(FILE *fp, char class);
+void write_cg_info(FILE *fp, char class);
+void write_ft_info(FILE *fp, char class);
+void write_ep_info(FILE *fp, char class);
+void write_dc_info(FILE *fp, char class);
+void write_is_info(FILE *fp, char class);
+void write_ua_info(FILE *fp, char class);
+void write_compiler_info(int type, FILE *fp);
+void write_convertdouble_info(int type, FILE *fp);
+void check_line(char *line, char *label, char *val);
+int  check_include_line(char *line, char *filename);
+void put_string(FILE *fp, char *name, char *val);
+void put_def_string(FILE *fp, char *name, char *val);
+void put_def_variable(FILE *fp, char *name, char *val);
+int ilog2(int i);
+double power(double base, int i);
+
+enum benchmark_types {SP, BT, LU, MG, FT, IS, EP, CG, UA, DC};
+
+int main(int argc, char *argv[])
+{
+  int type;
+  char class, class_old;
+  
+  if (argc != 3) {
+    printf("Usage: %s benchmark-name class\n", argv[0]);
+    exit(1);
+  }
+
+  /* Get command line arguments. Make sure they're ok. */
+  get_info(argv, &type, &class);
+  if (class != 'U') {
+#ifdef VERBOSE
+    printf("setparams: For benchmark %s: class = %c\n", 
+	   argv[1], class); 
+#endif
+    check_info(type, class);
+  }
+
+  /* Get old information. */
+  read_info(type, &class_old);
+  if (class != 'U') {
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams:     old settings: class = %c\n", 
+	     class_old); 
+#endif
+    }
+  } else {
+    printf("setparams:\n\
+  *********************************************************************\n\
+  * You must specify CLASS to build this benchmark                    *\n\
+  * For example, to build a class A benchmark, type                   *\n\
+  *       make {benchmark-name} CLASS=A                               *\n\
+  *********************************************************************\n\n"); 
+
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams: Previous settings were CLASS=%c \n", class_old); 
+#endif
+    }
+    exit(1); /* exit on class==U */
+  }
+
+  /* Write out new information if it's different. */
+  if (class != class_old) {
+#ifdef VERBOSE
+    printf("setparams: Writing %s\n", FILENAME); 
+#endif
+    write_info(type, class);
+  } else {
+#ifdef VERBOSE
+    printf("setparams: Settings unchanged. %s unmodified\n", FILENAME); 
+#endif
+  }
+
+  return 0;
+}
+
+
+/*
+ *  get_info(): Get parameters from command line 
+ */
+
+void get_info(char *argv[], int *typep, char *classp) 
+{
+
+  *classp = *argv[2];
+
+  if      (!strcmp(argv[1], "sp") || !strcmp(argv[1], "SP")) *typep = SP;
+  else if (!strcmp(argv[1], "bt") || !strcmp(argv[1], "BT")) *typep = BT;
+  else if (!strcmp(argv[1], "ft") || !strcmp(argv[1], "FT")) *typep = FT;
+  else if (!strcmp(argv[1], "lu") || !strcmp(argv[1], "LU")) *typep = LU;
+  else if (!strcmp(argv[1], "mg") || !strcmp(argv[1], "MG")) *typep = MG;
+  else if (!strcmp(argv[1], "is") || !strcmp(argv[1], "IS")) *typep = IS;
+  else if (!strcmp(argv[1], "ep") || !strcmp(argv[1], "EP")) *typep = EP;
+  else if (!strcmp(argv[1], "cg") || !strcmp(argv[1], "CG")) *typep = CG;
+  else if (!strcmp(argv[1], "ua") || !strcmp(argv[1], "UA")) *typep = UA;
+  else if (!strcmp(argv[1], "dc") || !strcmp(argv[1], "DC")) *typep = DC;
+  else {
+    printf("setparams: Error: unknown benchmark type %s\n", argv[1]);
+    exit(1);
+  }
+}
+
+/*
+ *  check_info(): Make sure command line data is ok for this benchmark 
+ */
+
+void check_info(int type, char class) 
+{
+
+  /* check class */
+  if (class != 'S' && 
+      class != 'W' && 
+      class != 'A' && 
+      class != 'B' && 
+      class != 'C' && 
+      class != 'D' && 
+      class != 'E') {
+    printf("setparams: Unknown benchmark class %c\n", class); 
+    printf("setparams: Allowed classes are \"S\", \"W\", and \"A\" through \"E\"\n");
+    exit(1);
+  }
+
+  if (class == 'E' && (type == IS || type == UA || type == DC)) {
+    printf("setparams: Benchmark class %c not defined for IS, UA, or DC\n", class);
+    exit(1);
+  }
+  if ((class == 'C' || class == 'D') && type == DC) {
+    printf("setparams: Benchmark class %c not defined for DC\n", class);
+    exit(1);
+  }
+
+}
+
+
+/* 
+ * read_info(): Read previous information from file. 
+ *              Not an error if file doesn't exist, because this
+ *              may be the first time we're running. 
+ *              Assumes the first line of the file is in a special
+ *              format that we understand (since we wrote it). 
+ */
+
+void read_info(int type, char *classp)
+{
+  int nread;
+  FILE *fp;
+  fp = fopen(FILENAME, "r");
+  if (fp == NULL) {
+#ifdef VERBOSE
+    printf("setparams: INFO: configuration file %s does not exist (yet)\n", FILENAME); 
+#endif
+    goto abort;
+  }
+  
+  /* first line of file contains info (fortran), first two lines (C) */
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          nread = fscanf(fp, DESC_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      case IS:
+      case DC:
+          nread = fscanf(fp, DEF_CLASS_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      default:
+        /* never should have gotten this far with a bad name */
+        printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+        exit(1);
+  }
+
+  fclose(fp);
+
+
+  return;
+
+ abort:
+  *classp = 'X';
+  return;
+}
+
+
+/* 
+ * write_info(): Write new information to config file. 
+ *               First line is in a special format so we can read
+ *               it in again. Then comes a warning. The rest is all
+ *               specific to a particular benchmark. 
+ */
+
+void write_info(int type, char class) 
+{
+  FILE *fp;
+  fp = fopen(FILENAME, "w");
+  if (fp == NULL) {
+    printf("setparams: Can't open file %s for writing\n", FILENAME);
+    exit(1);
+  }
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          /* Write out the header */
+          fprintf(fp, DESC_LINE, class);
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+c  \n\
+c  \n\
+c  This file is generated automatically by the setparams utility.\n\
+c  It sets the number of processors and the class of the NPB\n\
+c  in this directory. Do not modify it by hand.\n\
+c  \n");
+          break;
+      case IS:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.   */\n\
+   \n");
+          break;
+      case DC:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.\n\
+   This file provided for backward compatibility.\n\
+   It is not used in DC benchmark.   */\n\
+   \n");
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+  /* Now do benchmark-specific stuff */
+  switch(type) {
+  case SP:
+    write_sp_info(fp, class);
+    break;	      
+  case BT:	      
+    write_bt_info(fp, class);
+    break;	      
+  case DC:	      
+    write_dc_info(fp, class);
+    break;	      
+  case LU:	      
+    write_lu_info(fp, class);
+    break;	      
+  case MG:	      
+    write_mg_info(fp, class);
+    break;	      
+  case IS:	      
+    write_is_info(fp, class);  
+    break;	      
+  case FT:	      
+    write_ft_info(fp, class);
+    break;	      
+  case EP:	      
+    write_ep_info(fp, class);
+    break;	      
+  case CG:	      
+    write_cg_info(fp, class);
+    break;
+  case UA:	      
+    write_ua_info(fp, class);
+    break;
+  default:
+    printf("setparams: (Internal error): Unknown benchmark type %d\n", type);
+    exit(1);
+  }
+  write_convertdouble_info(type, fp);
+  write_compiler_info(type, fp);
+  fclose(fp);
+  return;
+}
+
+
+/* 
+ * write_sp_info(): Write SP specific info to config file
+ */
+
+void write_sp_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.015d0";   niter = 100; }
+  else if (class == 'W') { problem_size = 36;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'B') { problem_size = 102; dt = "0.001d0";   niter = 400; }
+  else if (class == 'C') { problem_size = 162; dt = "0.00067d0"; niter = 400; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00030d0"; niter = 500; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.0001d0"; niter = 500; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_bt_info(): Write BT specific info to config file
+ */
+
+void write_bt_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.010d0";   niter = 60; }
+  else if (class == 'W') { problem_size = 24;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'B') { problem_size = 102; dt = "0.0003d0";  niter = 200; }
+  else if (class == 'C') { problem_size = 162; dt = "0.0001d0";  niter = 200; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00002d0";  niter = 250; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.4d-5";    niter = 250; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_dc_info(): Write DC specific info to config file
+ */
+
+
+void write_dc_info(FILE *fp, char class) 
+{
+  long int input_tuples, attrnum;
+  if      (class == 'S') { input_tuples = 1000;     attrnum = 5; }
+  else if (class == 'W') { input_tuples = 100000;   attrnum = 10; }
+  else if (class == 'A') { input_tuples = 1000000;  attrnum = 15; }
+  else if (class == 'B') { input_tuples = 10000000; attrnum = 20; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "long long int input_tuples=%ld, attrnum=%ld;\n",
+              input_tuples, attrnum);
+}
+
+/* 
+ * write_lu_info(): Write LU specific info to config file
+ */
+
+void write_lu_info(FILE *fp, char class) 
+{
+  int isiz1, isiz2, itmax, inorm, problem_size;
+  char *dt_default;
+
+  if      (class == 'S') { problem_size = 12;  dt_default = "0.5d0"; itmax = 50; }
+  else if (class == 'W') { problem_size = 33;  dt_default = "1.5d-3"; itmax = 300; }
+  else if (class == 'A') { problem_size = 64;  dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'B') { problem_size = 102; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'C') { problem_size = 162; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'D') { problem_size = 408; dt_default = "1.0d0"; itmax = 300; }
+  else if (class == 'E') { problem_size = 1020; dt_default = "0.5d0"; itmax = 300; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  inorm = itmax;
+  isiz1 = problem_size;
+  isiz2 = problem_size;
+  
+
+  fprintf(fp, "\nc full problem size\n");
+  fprintf(fp, "%sinteger isiz1, isiz2, isiz3\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz1=%d, isiz2=%d, isiz3=%d)\n", 
+	       FINDENT, isiz1, isiz2, problem_size );
+
+  fprintf(fp, "\nc number of iterations and how often to print the norm\n");
+  fprintf(fp, "%sinteger itmax_default, inorm_default\n", FINDENT);
+  fprintf(fp, "%sparameter (itmax_default=%d, inorm_default=%d)\n", 
+	  FINDENT, itmax, inorm);
+
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt_default);
+  
+}
+
+/* 
+ * write_mg_info(): Write MG specific info to config file
+ */
+
+void write_mg_info(FILE *fp, char class) 
+{
+  int problem_size, nit, log2_size, lt_default, lm;
+  int ndim1, ndim2, ndim3;
+  if      (class == 'S') { problem_size = 32; nit = 4; }
+/*  else if (class == 'W') { problem_size = 64; nit = 40; }*/
+  else if (class == 'W') { problem_size = 128; nit = 4; }
+  else if (class == 'A') { problem_size = 256; nit = 4; }
+  else if (class == 'B') { problem_size = 256; nit = 20; }
+  else if (class == 'C') { problem_size = 512; nit = 20; }
+  else if (class == 'D') { problem_size = 1024; nit = 50; }
+  else if (class == 'E') { problem_size = 2048; nit = 50; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  log2_size = ilog2(problem_size);
+  /* lt is log of largest total dimension */
+  lt_default = log2_size;
+  /* log of log of maximum dimension on a node */
+  lm = log2_size;
+  ndim1 = lm;
+  ndim3 = log2_size;
+  ndim2 = log2_size;
+
+  fprintf(fp, "%sinteger nx_default, ny_default, nz_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx_default=%d, ny_default=%d, nz_default=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+  fprintf(fp, "%sinteger nit_default, lm, lt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nit_default=%d, lm = %d, lt_default=%d)\n", 
+	  FINDENT, nit, lm, lt_default);
+  fprintf(fp, "%sinteger debug_default\n", FINDENT);
+  fprintf(fp, "%sparameter (debug_default=%d)\n", FINDENT, 0);
+  fprintf(fp, "%sinteger ndim1, ndim2, ndim3\n", FINDENT);
+  fprintf(fp, "%sparameter (ndim1 = %d, ndim2 = %d, ndim3 = %d)\n", 
+	  FINDENT, ndim1, ndim2, ndim3);
+  fprintf(fp, "%sinteger%s one, nv, nr, ir\n", 
+          FINDENT, (problem_size > 1024)? "*8" : "");
+  fprintf(fp, "%sparameter (one=1)\n", FINDENT);
+}
+
+
+/* 
+ * write_is_info(): Write IS specific info to config file
+ */
+
+void write_is_info(FILE *fp, char class) 
+{
+  if( class != 'S' &&
+      class != 'W' &&
+      class != 'A' &&
+      class != 'B' &&
+      class != 'C' &&
+      class != 'D')
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+}
+
+
+/* 
+ * write_cg_info(): Write CG specific info to config file
+ */
+
+void write_cg_info(FILE *fp, char class) 
+{
+  int na,nonzer,niter;
+  char *shift,*rcond="1.0d-1";
+  char *shiftS="10.",
+       *shiftW="12.",
+       *shiftA="20.",
+       *shiftB="60.",
+       *shiftC="110.",
+       *shiftD="500.",
+       *shiftE="1.5d3";
+
+
+  if( class == 'S' )
+  { na=1400; nonzer=7; niter=15; shift=shiftS; }
+  else if( class == 'W' )
+  { na=7000; nonzer=8; niter=15; shift=shiftW; }
+  else if( class == 'A' )
+  { na=14000; nonzer=11; niter=15; shift=shiftA; }
+  else if( class == 'B' )
+  { na=75000; nonzer=13; niter=75; shift=shiftB; }
+  else if( class == 'C' )
+  { na=150000; nonzer=15; niter=75; shift=shiftC; }
+  else if( class == 'D' )
+  { na=1500000; nonzer=21; niter=100; shift=shiftD; }
+  else if( class == 'E' )
+  { na=9000000; nonzer=26; niter=100; shift=shiftE; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf( fp, "%sinteger            na, nonzer, niter\n", FINDENT );
+  fprintf( fp, "%sdouble precision   shift, rcond\n", FINDENT );
+  fprintf( fp, "%sparameter(  na=%d,\n", FINDENT, na );
+  fprintf( fp, "%s             nonzer=%d,\n", CONTINUE, nonzer );
+  fprintf( fp, "%s             niter=%d,\n", CONTINUE, niter );
+  fprintf( fp, "%s             shift=%s,\n", CONTINUE, shift );
+  fprintf( fp, "%s             rcond=%s )\n", CONTINUE, rcond );
+  
+}
+
+/* 
+ * write_ua_info(): Write UA specific info to config file
+ */
+
+void write_ua_info(FILE *fp, char class) 
+{
+  int lelt, lmor,refine_max, niter, nmxh, fre;
+  char *alpha;
+
+  fre = 5;
+  if( class == 'S' )
+  { lelt=250;lmor=11600;       refine_max=4;  niter=50;  nmxh=10; alpha="0.040d0"; }
+  else if( class == 'W' )
+  { lelt=700;lmor=26700;       refine_max=5;  niter=100; nmxh=10; alpha="0.060d0"; }
+  else if( class == 'A' )
+  { lelt=2400;lmor=92700;      refine_max=6;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'B' )
+  { lelt=8800;  lmor=334600;   refine_max=7;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'C' )
+  { lelt=33500; lmor=1262100;  refine_max=8;  niter=200; nmxh=10; alpha="0.067d0"; }
+  else if( class == 'D' )
+  { lelt=515000;lmor=19500000; refine_max=10; niter=250; nmxh=10; alpha="0.046d0"; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  
+  fprintf( fp, "%sinteger          lelt, lmor, refine_max, fre_default\n", FINDENT );
+  fprintf( fp, "%sinteger          niter_default, nmxh_default\n", FINDENT );
+  fprintf( fp, "%scharacter        class_default\n", FINDENT );
+  fprintf( fp, "%sdouble precision alpha_default\n", FINDENT );
+  fprintf( fp, "%sparameter(  lelt=%d,\n", FINDENT, lelt );
+  fprintf( fp, "%s            lmor=%d,\n", CONTINUE, lmor );
+  fprintf( fp, "%s             refine_max=%d,\n", CONTINUE, refine_max );
+  fprintf( fp, "%s             fre_default=%d,\n", CONTINUE, fre );
+  fprintf( fp, "%s             niter_default=%d,\n", CONTINUE, niter );
+  fprintf( fp, "%s             nmxh_default=%d,\n", CONTINUE, nmxh );
+  fprintf( fp, "%s             class_default=\"%c\",\n", CONTINUE, class );
+  fprintf( fp, "%s             alpha_default=%s )\n", CONTINUE, alpha );
+  
+}
+
+/* 
+ * write_ft_info(): Write FT specific info to config file
+ */
+
+void write_ft_info(FILE *fp, char class) 
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int nx, ny, nz, maxdim, niter;
+  if      (class == 'S') { nx = 64; ny = 64; nz = 64; niter = 6;}
+  else if (class == 'W') { nx = 128; ny = 128; nz = 32; niter = 6;}
+  else if (class == 'A') { nx = 256; ny = 256; nz = 128; niter = 6;}
+  else if (class == 'B') { nx = 512; ny = 256; nz = 256; niter =20;}
+  else if (class == 'C') { nx = 512; ny = 512; nz = 512; niter =20;}
+  else if (class == 'D') { nx = 2048; ny = 1024; nz = 1024; niter =25;}
+  else if (class == 'E') { nx = 4096; ny = 2048; nz = 2048; niter =25;}
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  maxdim = nx;
+  if (ny > maxdim) maxdim = ny;
+  if (nz > maxdim) maxdim = nz;
+  fprintf(fp, "%sinteger nx, ny, nz, maxdim, niter_default\n", FINDENT);
+  fprintf(fp, "%sinteger%s ntotal, nxp, nyp, ntotalp\n", FINDENT,
+          (nx > 1024)? "*8" : "");
+  fprintf(fp, "%sparameter (nx=%d, ny=%d, nz=%d, maxdim=%d)\n", 
+          FINDENT, nx, ny, nz, maxdim);
+  fprintf(fp, "%sparameter (niter_default=%d)\n", FINDENT, niter);
+  fprintf(fp, "%sparameter (nxp=nx+1, nyp=ny)\n", FINDENT);
+  fprintf(fp, "%sparameter (ntotal=nx*nyp*nz)\n", FINDENT);
+  fprintf(fp, "%sparameter (ntotalp=nxp*nyp*nz)\n", FINDENT);
+
+}
+
+/*
+ * write_ep_info(): Write EP specific info to config file
+ */
+
+void write_ep_info(FILE *fp, char class)
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int m;
+  if      (class == 'S') { m = 24; }
+  else if (class == 'W') { m = 25; }
+  else if (class == 'A') { m = 28; }
+  else if (class == 'B') { m = 30; }
+  else if (class == 'C') { m = 32; }
+  else if (class == 'D') { m = 36; }
+  else if (class == 'E') { m = 40; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+
+  fprintf(fp, "%scharacter class\n",FINDENT);
+  fprintf(fp, "%sparameter (class =\'%c\')\n",
+                  FINDENT, class);
+  fprintf(fp, "%sinteger m\n", FINDENT);
+  fprintf(fp, "%sparameter (m=%d)\n", FINDENT, m);
+}
+
+
+/* 
+ * This is a gross hack to allow the benchmarks to 
+ * print out how they were compiled. Various other ways
+ * of doing this have been tried and they all fail on
+ * some machine - due to a broken "make" program, or
+ * F77 limitations, of whatever. Hopefully this will
+ * always work because it uses very portable C. Unfortunately
+ * it relies on parsing the make.def file - YUK. 
+ * If your machine doesn't have <string.h> or <ctype.h>, happy hacking!
+ * 
+ */
+
+#define VERBOSE
+#define LL 400
+#include <stdio.h>
+#define DEFFILE "../config/make.def"
+#define DEFAULT_MESSAGE "(none)"
+FILE *deffile;
+void write_compiler_info(int type, FILE *fp)
+{
+  char line[LL];
+  char f77[LL], flink[LL], f_lib[LL], f_inc[LL], fflags[LL], flinkflags[LL];
+  char compiletime[LL], randfile[LL];
+  char cc[LL], cflags[LL], clink[LL], clinkflags[LL],
+       c_lib[LL], c_inc[LL];
+  struct tm *tmp;
+  time_t t;
+  deffile = fopen(DEFFILE, "r");
+  if (deffile == NULL) {
+    printf("\n\
+setparams: File %s doesn't exist. To build the NAS benchmarks\n\
+           you need to create it according to the instructions\n\
+           in the README in the main directory and comments in \n\
+           the file config/make.def.template\n", DEFFILE);
+    exit(1);
+  }
+  strcpy(f77, DEFAULT_MESSAGE);
+  strcpy(flink, DEFAULT_MESSAGE);
+  strcpy(f_lib, DEFAULT_MESSAGE);
+  strcpy(f_inc, DEFAULT_MESSAGE);
+  strcpy(fflags, DEFAULT_MESSAGE);
+  strcpy(flinkflags, DEFAULT_MESSAGE);
+  strcpy(randfile, DEFAULT_MESSAGE);
+  strcpy(cc, DEFAULT_MESSAGE);
+  strcpy(cflags, DEFAULT_MESSAGE);
+  strcpy(clink, DEFAULT_MESSAGE);
+  strcpy(clinkflags, DEFAULT_MESSAGE);
+  strcpy(c_lib, DEFAULT_MESSAGE);
+  strcpy(c_inc, DEFAULT_MESSAGE);
+
+  while (fgets(line, LL, deffile) != NULL) {
+    if (*line == '#') continue;
+    /* yes, this is inefficient. but it's simple! */
+    check_line(line, "F77", f77);
+    check_line(line, "FLINK", flink);
+    check_line(line, "F_LIB", f_lib);
+    check_line(line, "F_INC", f_inc);
+    check_line(line, "FFLAGS", fflags);
+    check_line(line, "FLINKFLAGS", flinkflags);
+    check_line(line, "RAND", randfile);
+    check_line(line, "CC", cc);
+    check_line(line, "CFLAGS", cflags);
+    check_line(line, "CLINK", clink);
+    check_line(line, "CLINKFLAGS", clinkflags);
+    check_line(line, "C_LIB", c_lib);
+    check_line(line, "C_INC", c_inc);
+  }
+
+  
+  (void) time(&t);
+  tmp = localtime(&t);
+  (void) strftime(compiletime, (size_t)LL, "%d %b %Y", tmp);
+
+
+  switch(type) {
+      case FT:
+      case SP:
+      case BT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          put_string(fp, "compiletime", compiletime);
+          put_string(fp, "npbversion", VERSION);
+          put_string(fp, "cs1", f77);
+          put_string(fp, "cs2", flink);
+          put_string(fp, "cs3", f_lib);
+          put_string(fp, "cs4", f_inc);
+          put_string(fp, "cs5", fflags);
+          put_string(fp, "cs6", flinkflags);
+	  put_string(fp, "cs7", randfile);
+          break;
+      case IS:
+      case DC:
+          put_def_string(fp, "COMPILETIME", compiletime);
+          put_def_string(fp, "NPBVERSION", VERSION);
+          put_def_string(fp, "CC", cc);
+          put_def_string(fp, "CFLAGS", cflags);
+          put_def_string(fp, "CLINK", clink);
+          put_def_string(fp, "CLINKFLAGS", clinkflags);
+          put_def_string(fp, "C_LIB", c_lib);
+          put_def_string(fp, "C_INC", c_inc);
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+}
+
+void check_line(char *line, char *label, char *val)
+{
+  char *original_line;
+  int n;
+  original_line = line;
+  /* compare beginning of line and label */
+  while (*label != '\0' && *line == *label) {
+    line++; label++; 
+  }
+  /* if *label is not EOS, we must have had a mismatch */
+  if (*label != '\0') return;
+  /* if *line is not a space, actual label is longer than test label */
+  if (!isspace(*line) && *line != '=') return ; 
+  /* skip over white space */
+  while (isspace(*line)) line++;
+  /* next char should be '=' */
+  if (*line != '=') return;
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return;
+  /* finally we've come to the value */
+  strcpy(val, line);
+  /* chop off the newline at the end */
+  n = strlen(val)-1;
+  if (n >= 0 && val[n] == '\n')
+    val[n--] = '\0';
+  if (n >= 0 && val[n] == '\r')
+    val[n--] = '\0';
+  /* treat continuation */
+  while (val[n] == '\\' && fgets(original_line, LL, deffile)) {
+     line = original_line;
+     while (isspace(*line)) line++;
+     if (isspace(*original_line)) val[n++] = ' ';
+     while (*line && *line != '\n' && *line != '\r' && n < LL-1)
+       val[n++] = *line++;
+     val[n] = '\0';
+     n--;
+  }
+/*  if (val[n] == '\\') {
+    printf("\n\
+setparams: Error in file make.def. Because of the way in which\n\
+           command line arguments are incorporated into the\n\
+           executable benchmark, you can't have any continued\n\
+           lines in the file make.def, that is, lines ending\n\
+           with the character \"\\\". Although it may be ugly, \n\
+           you should be able to reformat without continuation\n\
+           lines. The offending line is\n\
+  %s\n", original_line);
+    exit(1);
+  } */
+}
+
+int check_include_line(char *line, char *filename)
+{
+  char *include_string = "include";
+  /* compare beginning of line and "include" */
+  while (*include_string != '\0' && *line == *include_string) {
+    line++; include_string++; 
+  }
+  /* if *include_string is not EOS, we must have had a mismatch */
+  if (*include_string != '\0') return(0);
+  /* if *line is not a space, first word is not "include" */
+  if (!isspace(*line)) return(0); 
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return(0);
+  /* next keyword should be name of include file in *filename */
+  while (*filename != '\0' && *line == *filename) {
+    line++; filename++; 
+  }  
+  if (*filename != '\0' || 
+      (*line != ' ' && *line != '\0' && *line !='\n')) return(0);
+  else return(1);
+}
+
+
+#define MAXL 46
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "%scharacter %s*%d\n", FINDENT, name, len);
+  fprintf(fp, "%sparameter (%s=\'%s\')\n", FINDENT, name, val);
+}
+
+/* need to escape quote (") in val */
+int fix_string_quote(char *val, char *newval, int maxl)
+{
+  int len;
+  int i, j;
+  len = strlen(val);
+  i = j = 0;
+  while (i < len && j < maxl) {
+    if (val[i] == '"')
+      newval[j++] = '\\';
+    if (j < maxl)
+      newval[j++] = val[i++];
+  }
+  newval[j] = '\0';
+  return j;
+}
+
+/* NOTE: is the ... stuff necessary in C? */
+void put_def_string(FILE *fp, char *name, char *val0)
+{
+  int len;
+  char val[MAXL+3];
+  len = fix_string_quote(val0, val, MAXL+2);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s \"%s\"\n", name, val);
+}
+
+void put_def_variable(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s %s\n", name, val);
+}
+
+
+
+#if 0
+
+/* this version allows arbitrarily long lines but 
+ * some compilers don't like that and they're rarely
+ * useful 
+ */
+
+#define LINELEN 65
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len, nlines, pos, i;
+  char line[100];
+  len = strlen(val);
+  nlines = len/LINELEN;
+  if (nlines*LINELEN < len) nlines++;
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, nlines*LINELEN, name);
+  fprintf(fp, "%sparameter (%s = \n", FINDENT, name);
+  for (i = 0; i < nlines; i++) {
+    pos = i*LINELEN;
+    if (i == 0) fprintf(fp, "%s\'", CONTINUE);
+    else        fprintf(fp, "%s", CONTINUE);
+    /* number should be same as LINELEN */
+    fprintf(fp, "%.65s", val+pos);
+    if (i == nlines-1) fprintf(fp, "\')\n");
+    else             fprintf(fp, "\n");
+  }
+}
+
+#endif
+
+
+/* integer log base two. Return error is argument isn't
+ * a power of two or is less than or equal to zero 
+ */
+
+int ilog2(int i)
+{
+  int log2;
+  int exp2 = 1;
+  if (i <= 0) return(-1);
+
+  for (log2 = 0; log2 < 30; log2++) {
+    if (exp2 == i) return(log2);
+    if (exp2 > i) break;
+    exp2 *= 2;
+  }
+  return(-1);
+}
+
+
+
+/* Power function. We could use pow from the math library, but then
+ * we would have to insist on always linking with the math library, just
+ * for this function. Since we only need pow with integer exponents,
+ * we'll code it ourselves here.
+ */
+
+double power(double base, int i)
+{
+  double x;
+
+  if (i==0) return (1.0);
+  else if (i<0) {
+    base = 1.0/base;
+    i = -i;
+  }
+  x = 1.0;
+  while (i>0) {
+    x *=base;
+    i--;
+  }
+  return (x);
+}
+    
+
+void write_convertdouble_info(int type, FILE *fp)
+{
+  switch(type) {
+  case SP:
+  case BT:
+  case LU:
+  case FT:
+  case MG:
+  case EP:
+  case CG:
+  case UA:
+    fprintf(fp, "%slogical  convertdouble\n", FINDENT);
+#ifdef CONVERTDOUBLE
+    fprintf(fp, "%sparameter (convertdouble = .true.)\n", FINDENT);
+#else
+    fprintf(fp, "%sparameter (convertdouble = .false.)\n", FINDENT);
+#endif
+    break;
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/suite.awk b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/suite.awk
new file mode 100644
index 0000000..461adab
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/sys/suite.awk
@@ -0,0 +1,10 @@
+BEGIN { SMAKE = "make" } {
+  if ($1 !~ /^#/ &&  NF > 1) {
+    printf "cd `echo %s|tr '[a-z]' '[A-Z]'`; %s clean;", $1, SMAKE;
+    printf "%s CLASS=%s", SMAKE, $2;
+    if (NF > 2) {
+      printf " VERSION=%s", $3;
+    }
+    printf "; cd ..\n";
+  }
+}
diff --git a/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/README b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/README
new file mode 100644
index 0000000..f8a8ad8
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/NPB3.3.1/README
@@ -0,0 +1,115 @@
+NAS Parallel Benchmarks Version 3.3 (NPB3.3)
+------------------------------------------------
+
+  NAS Parallel Benchmarks Team
+  NASA Ames Research Center
+  Mail Stop: T27A-1
+  Moffett Field, CA   94035-1000
+
+  E-mail:  npb@nas.nasa.gov                                      
+  Fax:     (650) 604-3957                                        
+  http://www.nas.nasa.gov/Software/NPB/
+
+
+================================================
+INSTALLATION
+
+  For documentation on installing and running the NAS Parallel
+  Benchmarks, refer to subdirectory README files.
+
+
+================================================
+BACKGROUND
+
+  Information on NPB 3.3, including the technical reports, the          
+  original specifications, source code, results and information        
+  on how to submit new results, is available at:                       
+
+     http://www.nas.nasa.gov/Software/NPB/                              
+
+
+================================================
+Summary of New Features and Improvements
+ (Details are given in Changes.log.)
+
+
+ in NPB3.3.1 from NPB3.3:
+
+  - Bug fixes for:
+      MPI/FT - non-portable way of broadcasting input parameters
+      {OMP,SER}/DC - access to out-of-bound array elements
+      {OMP,SER}/UA - use of uninitialized array
+
+  - Code clean up in MPI/LU: avoid using MPI_ANY_SOURCE and delete
+      unused codes
+
+  - Additional timers are included in the MPI version
+
+  - Executables produced for OMP and SER now use ".x" as an extension
+
+
+ in NPB3.3 from NPB3.2.1:
+
+  - Introduction of the Class E problem in seven of the benchmarks
+    (BT, SP, LU, CG, MG, FT, and EP) to stress larger size parallel 
+    computers.
+
+  - Class D added to the IS benchmark in all three implementations.
+
+  - Enable the Bucket sort option for OMP/IS.
+
+  - Introduction of the "twiddle" array in the OpenMP FT benchmark
+    to improve performance
+
+  - Array padding in MPI/SP was adjusted to improve performance
+
+  - Merge the vector codes for the BT and LU benchmarks into this
+    release.
+
+  - The hyperplane version of LU (LU-HP) is no longer included 
+    in the distribution.  Download NPB3.2.1 if needed.
+
+
+ in NPB3.2.1 from NPB3.2:
+
+  - A number of bug fixes for the MPI versions of {FT, LU, MG, BT} and 
+    the OpenMP version of LU
+
+  - Improvements on the OpenMP versions of {EP, IS, UA}
+    (see *OMP/UA/README for a special note on UA)
+
+
+ in NPB3.2 from NPB3.1:
+
+  - Serial DC was converted to C from C++ (only classes S, W, A and B
+    are available)
+
+  - OpenMP version of DC was added (only classes S, W, A and B
+    are available)
+
+  - Inclusion of the new DT benchmark (MPI)
+
+
+ in NPB3.1 from NPB3.0 & NPB2.4:
+
+  - MPI, OpenMP, and Serial versions are now merged into one package
+
+  - Inclusion of the Class D problem in both serial and OpenMP versions
+
+  - Inclusion of the new UA benchmark (Serial & OpenMP)
+
+  - Inclusion of "LU-HP" in the OpenMP version
+
+  - Inclusion of the new DC benchmark (Serial)
+
+  - Use of relative errors for verification in both CG and MG
+
+  - Change in problem parameters for MG Class W
+
+
+The NPB IO benchmark is part of NPB3.3-MPI.  Check the README file
+in that subdirectory for additional information.
+
+The Java and HPF implementations are not included in this distribution.
+Please use the NPB3.0 distribution.
+
diff --git a/src/npb/disk-image/npb/npb-hooks/README.md b/src/npb/disk-image/npb/npb-hooks/README.md
new file mode 100644
index 0000000..bc81f9b
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/README.md
@@ -0,0 +1,26 @@
+# npb-hooks
+Annotating the region of interest for npb.
+
+IMPORTANT NOTE:  This repo is not supposed to be the canonical source for the benchmarks and serves only as an example for annotating the ROI. The source code can be obtained from [NAS Parallel Benchmarks](https://www.nas.nasa.gov/publications/npb.html)
+
+This repo adds ROI hooks for NAS Parallel Benchmark (OMP version for now). In this particular implementation, the hooks are coupled with gem5 specific instructions (m5_dumpreststats) to collect the stats for the ROI. But the hooks can be used for any other tool with minimal effort.
+
+To enable hooks, make with HOOKS=1
+
+## Summary of the steps taken:
+
+### For the suite:
+hooks.c defines the functions called by each benchmark, and the actions to be taken at the start/end of the ROI.
+
+Adding gem5 instructions to the hooks:
+In make.common we should add proper compilation options to create object files.
+
+In make.def we should define the path to gem5 directory. Also, -cpp should be added to the fortran compiler (FF) options to enable support for C pre-processors.
+
+### For each benchmark in the suite:
+The source file (i.e. BENCH.f or BENCH.c) should be modified to call roi_begin and roi_end functions. In here, we follow a the methodology used by the developers and the function calls are place right before and after the timing procedures.
+We use pre-processor for conditional compilation of added function calls (HOOKS).
+
+The make files should be modified to add the object files created (hooks.o and any other possible dependencies - in our case m5op_x86.o).
+Also, if hooks are enabled, proper flag should be set (-DHOOKS) in the final step of the compilation process (creating the executable).
+These are both done "conditionally" under HOOKS flag (ifeq ($HOOKS, 1)).
diff --git a/src/npb/disk-image/npb/npb-hooks/disk-image/README.md b/src/npb/disk-image/npb/npb-hooks/disk-image/README.md
new file mode 100644
index 0000000..b2a90f1
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/disk-image/README.md
@@ -0,0 +1,26 @@
+# Steps to create the disk image  
+Step 1. Download `packer` at [packer.io](https://www.packer.io/downloads.html).  
+Step 2. Build the disk image  
+```bash
+./packer build ubuntu.json
+```
+The output will be in the folder `output-ubuntu1804`. The disk image is in RAW format.  
+# Customize the disk image
+`scripts/post-installation.sh`: the script that runs after Ubuntu Server is installed.
+## How to execute a command as a root user?
+For example, if the password is `12345`,
+```bash
+echo 12345 | sudo [command];
+```
+## How to access the disk image after the building process (e.g. to install packages/to inspect the image manually)?
+Adding the following to the post installation scirpt would make it sleeps until a file exists, which means you can make the file to exit. For example, if the file is `/tmp/quit`, then add the following to the post installation script,
+```bash
+while [ ! -f /tmp/quit ]
+do
+  sleep 1m
+done
+```
+and to exit,
+```bash
+touch /tmp/quit
+```
diff --git a/src/npb/disk-image/npb/npb-hooks/disk-image/http/preseed.cfg b/src/npb/disk-image/npb/npb-hooks/disk-image/http/preseed.cfg
new file mode 100644
index 0000000..4cdccfc
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/disk-image/http/preseed.cfg
@@ -0,0 +1,494 @@
+#### Contents of the preconfiguration file (for stretch)
+### Localization
+# Preseeding only locale sets language, country and locale.
+d-i debian-installer/locale string en_US
+
+# The values can also be preseeded individually for greater flexibility.
+#d-i debian-installer/language string en
+#d-i debian-installer/country string NL
+#d-i debian-installer/locale string en_GB.UTF-8
+# Optionally specify additional locales to be generated.
+#d-i localechooser/supported-locales multiselect en_US.UTF-8, nl_NL.UTF-8
+
+# Keyboard selection.
+# Disable automatic (interactive) keymap detection.
+d-i console-setup/ask_detect boolean false
+d-i keyboard-configuration/xkb-keymap select us
+# To select a variant of the selected layout:
+#d-i keyboard-configuration/xkb-keymap select us(dvorak)
+# d-i keyboard-configuration/toggle select No toggling
+
+### Network configuration
+# Disable network configuration entirely. This is useful for cdrom
+# installations on non-networked devices where the network questions,
+# warning and long timeouts are a nuisance.
+#d-i netcfg/enable boolean false
+
+# netcfg will choose an interface that has link if possible. This makes it
+# skip displaying a list if there is more than one interface.
+d-i netcfg/choose_interface select auto
+
+
+
+# To set a different link detection timeout (default is 3 seconds).
+# Values are interpreted as seconds.
+#d-i netcfg/link_wait_timeout string 10
+
+# If you have a slow dhcp server and the installer times out waiting for
+# it, this might be useful.
+#d-i netcfg/dhcp_timeout string 60
+#d-i netcfg/dhcpv6_timeout string 60
+
+# If you prefer to configure the network manually, uncomment this line and
+# the static network configuration below.
+#d-i netcfg/disable_autoconfig boolean true
+
+# If you want the preconfiguration file to work on systems both with and
+# without a dhcp server, uncomment these lines and the static network
+# configuration below.
+#d-i netcfg/dhcp_failed note
+#d-i netcfg/dhcp_options select Configure network manually
+
+# Static network configuration.
+#
+# IPv4 example
+#d-i netcfg/get_ipaddress string 192.168.1.42
+#d-i netcfg/get_netmask string 255.255.255.0
+#d-i netcfg/get_gateway string 192.168.1.1
+#d-i netcfg/get_nameservers string 192.168.1.1
+#d-i netcfg/confirm_static boolean true
+#
+# IPv6 example
+#d-i netcfg/get_ipaddress string fc00::2
+#d-i netcfg/get_netmask string ffff:ffff:ffff:ffff::
+#d-i netcfg/get_gateway string fc00::1
+#d-i netcfg/get_nameservers string fc00::1
+#d-i netcfg/confirm_static boolean true
+
+# Any hostname and domain names assigned from dhcp take precedence over
+# values set here. However, setting the values still prevents the questions
+# from being shown, even if values come from dhcp.
+d-i netcfg/get_hostname string unassigned-hostname
+d-i netcfg/get_domain string unassigned-domain
+
+# If you want to force a hostname, regardless of what either the DHCP
+# server returns or what the reverse DNS entry for the IP is, uncomment
+# and adjust the following line.
+#d-i netcfg/hostname string somehost
+
+# Disable that annoying WEP key dialog.
+d-i netcfg/wireless_wep string
+# The wacky dhcp hostname that some ISPs use as a password of sorts.
+#d-i netcfg/dhcp_hostname string radish
+
+# If non-free firmware is needed for the network or other hardware, you can
+# configure the installer to always try to load it, without prompting. Or
+# change to false to disable asking.
+#d-i hw-detect/load_firmware boolean true
+
+### Network console
+# Use the following settings if you wish to make use of the network-console
+# component for remote installation over SSH. This only makes sense if you
+# intend to perform the remainder of the installation manually.
+#d-i anna/choose_modules string network-console
+#d-i network-console/authorized_keys_url string http://10.0.0.1/openssh-key
+#d-i network-console/password password r00tme
+#d-i network-console/password-again password r00tme
+# Use this instead if you prefer to use key-based authentication
+#d-i network-console/authorized_keys_url http://host/authorized_keys
+
+### Mirror settings
+# If you select ftp, the mirror/country string does not need to be set.
+#d-i mirror/protocol string ftp
+d-i mirror/country string manual
+d-i mirror/http/hostname string archive.ubuntu.com
+d-i mirror/http/directory string /ubuntu
+d-i mirror/http/proxy string
+
+# Alternatively: by default, the installer uses CC.archive.ubuntu.com where
+# CC is the ISO-3166-2 code for the selected country. You can preseed this
+# so that it does so without asking.
+#d-i mirror/http/mirror select CC.archive.ubuntu.com
+
+# Suite to install.
+#d-i mirror/suite string stretch
+# Suite to use for loading installer components (optional).
+#d-i mirror/udeb/suite string stretch
+# Components to use for loading installer components (optional).
+#d-i mirror/udeb/components multiselect main, restricted
+
+### Account setup
+# Skip creation of a root account (normal user account will be able to
+# use sudo). The default is false; preseed this to true if you want to set
+# a root password.
+d-i passwd/root-login boolean false
+# Alternatively, to skip creation of a normal user account.
+#d-i passwd/make-user boolean false
+
+# Root password, either in clear text
+#d-i passwd/root-password password r00tme
+#d-i passwd/root-password-again password r00tme
+# or encrypted using a crypt(3)  hash.
+#d-i passwd/root-password-crypted password [crypt(3) hash]
+
+# To create a normal user account.
+d-i passwd/user-fullname string gem5
+d-i passwd/username string gem5
+# Normal user's password, either in clear text
+d-i passwd/user-password password 12345
+d-i passwd/user-password-again password 12345
+# or encrypted using a crypt(3) hash.
+#d-i passwd/user-password-crypted password [crypt(3) hash]
+# Create the first user with the specified UID instead of the default.
+#d-i passwd/user-uid string 1010
+# The installer will warn about weak passwords. If you are sure you know
+# what you're doing and want to override it, uncomment this.
+d-i user-setup/allow-password-weak boolean true
+
+# The user account will be added to some standard initial groups. To
+# override that, use this.
+#d-i passwd/user-default-groups string audio cdrom video
+
+# Set to true if you want to encrypt the first user's home directory.
+d-i user-setup/encrypt-home boolean false
+
+### Clock and time zone setup
+# Controls whether or not the hardware clock is set to UTC.
+d-i clock-setup/utc boolean true
+
+# You may set this to any valid setting for $TZ; see the contents of
+# /usr/share/zoneinfo/ for valid values.
+d-i time/zone string US/Eastern
+
+# Controls whether to use NTP to set the clock during the install
+d-i clock-setup/ntp boolean true
+# NTP server to use. The default is almost always fine here.
+#d-i clock-setup/ntp-server string ntp.example.com
+
+### i386 specific disk storage
+# Activate DASD disks
+#d-i s390-dasd/dasd string 0.0.0200,0.0.0300,0.0.0400
+
+# DASD configuration; by default dasdfmt (low-level format) if needed
+#d-i s390-dasd/auto-format boolean true
+#d-i s390-dasd/force-format boolean true
+
+# zFCP activation and configuration
+# d-i s390-zfcp/zfcp string 0.0.1b34:0x400870075678a1b2:0x201480c800000000, \
+#                           0.0.1b34:0x400870075679a1b2:0x201480c800000000
+
+### Partitioning
+## Partitioning example
+# If the system has free space you can choose to only partition that space.
+# This is only honoured if partman-auto/method (below) is not set.
+# Alternatives: custom, some_device, some_device_crypto, some_device_lvm.
+#d-i partman-auto/init_automatically_partition select biggest_free
+
+# Alternatively, you may specify a disk to partition. If the system has only
+# one disk the installer will default to using that, but otherwise the device
+# name must be given in traditional, non-devfs format (so e.g. /dev/sda
+# and not e.g. /dev/discs/disc0/disc).
+# For example, to use the first SCSI/SATA hard disk:
+#d-i partman-auto/disk string /dev/sda
+# In addition, you'll need to specify the method to use.
+# The presently available methods are:
+# - regular: use the usual partition types for your architecture
+# - lvm:     use LVM to partition the disk
+# - crypto:  use LVM within an encrypted partition
+d-i partman-auto/method string regular
+
+# If one of the disks that are going to be automatically partitioned
+# contains an old LVM configuration, the user will normally receive a
+# warning. This can be preseeded away...
+d-i partman-lvm/device_remove_lvm boolean true
+# The same applies to pre-existing software RAID array:
+d-i partman-md/device_remove_md boolean true
+# And the same goes for the confirmation to write the lvm partitions.
+d-i partman-lvm/confirm boolean true
+d-i partman-lvm/confirm_nooverwrite boolean true
+
+# For LVM partitioning, you can select how much of the volume group to use
+# for logical volumes.
+#d-i partman-auto-lvm/guided_size string max
+#d-i partman-auto-lvm/guided_size string 10GB
+#d-i partman-auto-lvm/guided_size string 50%
+
+# You can choose one of the three predefined partitioning recipes:
+# - atomic: all files in one partition
+# - home:   separate /home partition
+# - multi:  separate /home, /var, and /tmp partitions
+d-i partman-auto/choose_recipe select atomic
+
+# Or provide a recipe of your own...
+# If you have a way to get a recipe file into the d-i environment, you can
+# just point at it.
+#d-i partman-auto/expert_recipe_file string /hd-media/recipe
+
+# If not, you can put an entire recipe into the preconfiguration file in one
+# (logical) line. This example creates a small /boot partition, suitable
+# swap, and uses the rest of the space for the root partition:
+#d-i partman-auto/expert_recipe string                         \
+#      boot-root ::                                            \
+#              40 50 100 ext3                                  \
+#                      $primary{ } $bootable{ }                \
+#                      method{ format } format{ }              \
+#                      use_filesystem{ } filesystem{ ext3 }    \
+#                      mountpoint{ /boot }                     \
+#              .                                               \
+#              500 10000 1000000000 ext3                       \
+#                      method{ format } format{ }              \
+#                      use_filesystem{ } filesystem{ ext3 }    \
+#                      mountpoint{ / }                         \
+#              .                                               \
+#              64 512 300% linux-swap                          \
+#                      method{ swap } format{ }                \
+#              .
+
+# If you just want to change the default filesystem from ext3 to something
+# else, you can do that without providing a full recipe.
+#d-i partman/default_filesystem string ext4
+
+# The full recipe format is documented in the file partman-auto-recipe.txt
+# included in the 'debian-installer' package or available from D-I source
+# repository. This also documents how to specify settings such as file
+# system labels, volume group names and which physical devices to include
+# in a volume group.
+
+# This makes partman automatically partition without confirmation, provided
+# that you told it what to do using one of the methods above.
+d-i partman-partitioning/confirm_write_new_label boolean true
+d-i partman/choose_partition select finish
+d-i partman/confirm boolean true
+d-i partman/confirm_nooverwrite boolean true
+
+## Partitioning using RAID
+# The method should be set to "raid".
+#d-i partman-auto/method string raid
+# Specify the disks to be partitioned. They will all get the same layout,
+# so this will only work if the disks are the same size.
+#d-i partman-auto/disk string /dev/sda /dev/sdb
+
+# Next you need to specify the physical partitions that will be used. 
+#d-i partman-auto/expert_recipe string \
+#      multiraid ::                                         \
+#              1000 5000 4000 raid                          \
+#                      $primary{ } method{ raid }           \
+#              .                                            \
+#              64 512 300% raid                             \
+#                      method{ raid }                       \
+#              .                                            \
+#              500 10000 1000000000 raid                    \
+#                      method{ raid }                       \
+#              .
+
+# Last you need to specify how the previously defined partitions will be
+# used in the RAID setup. Remember to use the correct partition numbers
+# for logical partitions. RAID levels 0, 1, 5, 6 and 10 are supported;
+# devices are separated using "#".
+# Parameters are:
+# <raidtype> <devcount> <sparecount> <fstype> <mountpoint> \
+#          <devices> <sparedevices>
+
+#d-i partman-auto-raid/recipe string \
+#    1 2 0 ext3 /                    \
+#          /dev/sda1#/dev/sdb1       \
+#    .                               \
+#    1 2 0 swap -                    \
+#          /dev/sda5#/dev/sdb5       \
+#    .                               \
+#    0 2 0 ext3 /home                \
+#          /dev/sda6#/dev/sdb6       \
+#    .
+
+# For additional information see the file partman-auto-raid-recipe.txt
+# included in the 'debian-installer' package or available from D-I source
+# repository.
+
+# This makes partman automatically partition without confirmation.
+d-i partman-md/confirm boolean true
+d-i partman-partitioning/confirm_write_new_label boolean true
+d-i partman/choose_partition select finish
+d-i partman/confirm boolean true
+d-i partman/confirm_nooverwrite boolean true
+
+## Controlling how partitions are mounted
+# The default is to mount by UUID, but you can also choose "traditional" to
+# use traditional device names, or "label" to try filesystem labels before
+# falling back to UUIDs.
+#d-i partman/mount_style select uuid
+
+### Base system installation
+# Configure a path to the preconfigured base filesystem. This can be used to
+# specify a path for the installer to retrieve the filesystem image that will
+# be deployed to disk and used as a base system for the installation.
+#d-i live-installer/net-image string /install/filesystem.squashfs
+ 
+# Configure APT to not install recommended packages by default. Use of this
+# option can result in an incomplete system and should only be used by very
+# experienced users.
+#d-i base-installer/install-recommends boolean false
+
+# The kernel image (meta) package to be installed; "none" can be used if no
+# kernel is to be installed.
+#d-i base-installer/kernel/image string linux-generic
+
+### Apt setup
+# You can choose to install restricted and universe software, or to install
+# software from the backports repository.
+#d-i apt-setup/restricted boolean true
+#d-i apt-setup/universe boolean true
+#d-i apt-setup/backports boolean true
+# Uncomment this if you don't want to use a network mirror.
+#d-i apt-setup/use_mirror boolean false
+# Select which update services to use; define the mirrors to be used.
+# Values shown below are the normal defaults.
+#d-i apt-setup/services-select multiselect security
+#d-i apt-setup/security_host string security.ubuntu.com
+#d-i apt-setup/security_path string /ubuntu
+
+# Additional repositories, local[0-9] available
+#d-i apt-setup/local0/repository string \
+#       http://local.server/ubuntu stretch main
+#d-i apt-setup/local0/comment string local server
+# Enable deb-src lines
+#d-i apt-setup/local0/source boolean true
+# URL to the public key of the local repository; you must provide a key or
+# apt will complain about the unauthenticated repository and so the
+# sources.list line will be left commented out
+#d-i apt-setup/local0/key string http://local.server/key
+
+# By default the installer requires that repositories be authenticated
+# using a known gpg key. This setting can be used to disable that
+# authentication. Warning: Insecure, not recommended.
+#d-i debian-installer/allow_unauthenticated boolean true
+
+# Uncomment this to add multiarch configuration for i386
+#d-i apt-setup/multiarch string i386
+
+
+### Package selection
+tasksel tasksel/first multiselect standard, ubuntu-server
+#tasksel tasksel/first multiselect lamp-server, print-server
+#tasksel tasksel/first multiselect kubuntu-desktop
+
+# Individual additional packages to install
+d-i pkgsel/include string openssh-server build-essential
+#d-i pkgsel/include string openssh-server
+# Whether to upgrade packages after debootstrap.
+# Allowed values: none, safe-upgrade, full-upgrade
+d-i pkgsel/upgrade select none
+
+# Language pack selection
+#d-i pkgsel/language-packs multiselect de, en, zh
+
+# Policy for applying updates. May be "none" (no automatic updates),
+# "unattended-upgrades" (install security updates automatically), or
+# "landscape" (manage system with Landscape).
+d-i pkgsel/update-policy select none
+
+# Some versions of the installer can report back on what software you have
+# installed, and what software you use. The default is not to report back,
+# but sending reports helps the project determine what software is most
+# popular and include it on CDs.
+popularity-contest popularity-contest/participate boolean false
+
+# By default, the system's locate database will be updated after the
+# installer has finished installing most packages. This may take a while, so
+# if you don't want it, you can set this to "false" to turn it off.
+#d-i pkgsel/updatedb boolean true
+
+### Boot loader installation
+# Grub is the default boot loader (for x86). If you want lilo installed
+# instead, uncomment this:
+#d-i grub-installer/skip boolean true
+# To also skip installing lilo, and install no bootloader, uncomment this
+# too:
+#d-i lilo-installer/skip boolean true
+
+
+# This is fairly safe to set, it makes grub install automatically to the MBR
+# if no other operating system is detected on the machine.
+d-i grub-installer/only_debian boolean true
+
+# This one makes grub-installer install to the MBR if it also finds some other
+# OS, which is less safe as it might not be able to boot that other OS.
+#d-i grub-installer/with_other_os boolean true
+
+# Due notably to potential USB sticks, the location of the MBR can not be
+# determined safely in general, so this needs to be specified:
+d-i grub-installer/bootdev  string /dev/sda
+# To install to the first device (assuming it is not a USB stick):
+d-i grub-installer/bootdev  string default
+
+# Alternatively, if you want to install to a location other than the mbr,
+# uncomment and edit these lines:
+#d-i grub-installer/only_debian boolean false
+#d-i grub-installer/with_other_os boolean false
+#d-i grub-installer/bootdev  string (hd0,1)
+# To install grub to multiple disks:
+#d-i grub-installer/bootdev  string (hd0,1) (hd1,1) (hd2,1)
+
+# Optional password for grub, either in clear text
+#d-i grub-installer/password password r00tme
+#d-i grub-installer/password-again password r00tme
+# or encrypted using an MD5 hash, see grub-md5-crypt(8).
+#d-i grub-installer/password-crypted password [MD5 hash]
+
+# Use the following option to add additional boot parameters for the
+# installed system (if supported by the bootloader installer).
+# Note: options passed to the installer will be added automatically.
+#d-i debian-installer/add-kernel-opts string nousb
+
+### Finishing up the installation
+# During installations from serial console, the regular virtual consoles
+# (VT1-VT6) are normally disabled in /etc/inittab. Uncomment the next
+# line to prevent this.
+#d-i finish-install/keep-consoles boolean true
+
+# Avoid that last message about the install being complete.
+d-i finish-install/reboot_in_progress note
+nobootloader nobootloader/confirmation_common note
+
+# This will prevent the installer from ejecting the CD during the reboot,
+# which is useful in some situations.
+#d-i cdrom-detect/eject boolean false
+
+# This is how to make the installer shutdown when finished, but not
+# reboot into the installed system.
+#d-i debian-installer/exit/halt boolean true
+# This will power off the machine instead of just halting it.
+#d-i debian-installer/exit/poweroff boolean true
+
+### Preseeding other packages
+# Depending on what software you choose to install, or if things go wrong
+# during the installation process, it's possible that other questions may
+# be asked. You can preseed those too, of course. To get a list of every
+# possible question that could be asked during an install, do an
+# installation, and then run these commands:
+#   debconf-get-selections --installer > file
+#   debconf-get-selections >> file
+
+
+#### Advanced options
+### Running custom commands during the installation
+## i386 Preseed Example
+# d-i preseeding is inherently not secure. Nothing in the installer checks
+# for attempts at buffer overflows or other exploits of the values of a
+# preconfiguration file like this one. Only use preconfiguration files from
+# trusted locations! To drive that home, and because it's generally useful,
+# here's a way to run any shell command you'd like inside the installer,
+# automatically.
+
+# This first command is run as early as possible, just after
+# preseeding is read.
+#d-i preseed/early_command string anna-install some-udeb
+# This command is run immediately before the partitioner starts. It may be
+# useful to apply dynamic partitioner preseeding that depends on the state
+# of the disks (which may not be visible when preseed/early_command runs).
+#d-i partman/early_command \
+#       string debconf-set partman-auto/disk "$(list-devices disk | head -n1)"
+# This command is run just before the install finishes, but when there is
+# still a usable /target directory. You can chroot to /target and use it
+# directly, or use the apt-install and in-target commands to easily install
+# packages and run commands in the target system.
+#d-i preseed/late_command string apt-get install openssh-server; in-target chsh -s /bin/zsh
diff --git a/src/npb/disk-image/npb/npb-hooks/disk-image/scripts/post-installation.sh b/src/npb/disk-image/npb/npb-hooks/disk-image/scripts/post-installation.sh
new file mode 100644
index 0000000..9621f06
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/disk-image/scripts/post-installation.sh
@@ -0,0 +1,8 @@
+echo 'Post Installation Started';
+echo '12345' | sudo apt-get install gfortran;
+echo 'Building NPB';
+cd ~/NPB3.3.1/NPB3.3-OMP/;
+mkdir bin;
+make suite -j 8;
+echo 'Building Done'
+echo 'Post Installation Done';
diff --git a/src/npb/disk-image/npb/npb-hooks/disk-image/ubuntu.json b/src/npb/disk-image/npb/npb-hooks/disk-image/ubuntu.json
new file mode 100644
index 0000000..40ba836
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-hooks/disk-image/ubuntu.json
@@ -0,0 +1,87 @@
+{
+    "builders":
+    [
+        {
+            "type": "qemu",
+            "format": "raw",
+            "accelerator": "kvm",
+            "boot_command":
+            [
+                "{{ user `boot_command_prefix` }}",
+                "debian-installer={{ user `locale` }} auto locale={{ user `locale` }} kbd-chooser/method=us ",
+                "file=/floppy/{{ user `preseed` }} ",
+                "fb=false debconf/frontend=noninteractive ",
+                "hostname={{ user `hostname` }} ",          
+                "/install/vmlinuz noapic ",
+                "initrd=/install/initrd.gz ",  
+                "keyboard-configuration/modelcode=SKIP keyboard-configuration/layout=USA ",
+                "keyboard-configuration/variant=USA console-setup/ask_detect=false ",
+                "passwd/user-fullname={{ user `ssh_fullname` }} ",
+                "passwd/user-password={{ user `ssh_password` }} ",
+                "passwd/user-password-again={{ user `ssh_password` }} ",
+                "passwd/username={{ user `ssh_username` }} ",
+                "-- <enter>"
+            ],
+            "cpus": "{{ user `vm_cpus`}}",
+            "disk_size": "{{ user `image_size` }}",
+            "floppy_files": 
+            [
+                "http/{{ user `preseed` }}"
+            ],
+            "headless": "{{ user `headless` }}",
+            "http_directory": "http",
+            "iso_checksum": "{{ user `iso_checksum` }}",
+            "iso_checksum_type": "{{ user `iso_checksum_type` }}",
+            "iso_urls": [ "{{ user `iso_url` }}" ],
+            "memory": "{{ user `vm_memory`}}",
+            "output_directory": "{{ user `image_name` }}-image",
+            "qemuargs": 
+            [
+                [ "-cpu", "host" ],
+                [ "-display", "none" ]
+            ],
+            "shutdown_command": "echo '{{ user `ssh_password` }}'|sudo -S shutdown -P now",
+            "ssh_password": "{{ user `ssh_password` }}",
+            "ssh_username": "{{ user `ssh_username` }}",
+            "ssh_wait_timeout": "60m",
+            "vm_name": "{{ user `image_name` }}"
+        }
+    ],
+    "provisioners": 
+    [
+        {
+            "type": "file",
+            "source": "../NPB3.3.1",                                                           
+            "destination": "/home/gem5"
+        },
+        {
+            "type": "shell",
+            "execute_command": "echo '{{ user `ssh_password` }}' | {{.Vars}} sudo -E -S bash '{{.Path}}'",
+            "scripts": 
+            [
+                "scripts/post-installation.sh"
+            ]
+        }
+    ],
+    "variables": 
+    {
+        "boot_command_prefix": "<enter><wait><f6><esc><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs>",
+        "desktop": "false",
+        "image_size": "8192",
+        "headless": "true",
+        "iso_checksum": "34416ff83179728d54583bf3f18d42d2",
+        "iso_checksum_type": "md5",
+        "iso_name": "ubuntu-18.04.2-server-amd64.iso",
+        "iso_url": "http://cdimage.ubuntu.com/releases/18.04.2/release/ubuntu-18.04.2-server-amd64.iso",
+        "locale": "en_US",
+        "preseed" : "preseed.cfg",
+        "hostname": "gem5",
+        "ssh_fullname": "gem5",
+        "ssh_password": "12345",
+        "ssh_username": "gem5",
+        "vm_cpus": "16",
+        "vm_memory": "8192",
+        "image_name": "ubuntu-NPB-benchmark"
+  }
+
+}
diff --git a/src/npb/disk-image/npb/npb-install.sh b/src/npb/disk-image/npb/npb-install.sh
new file mode 100755
index 0000000..3a88506
--- /dev/null
+++ b/src/npb/disk-image/npb/npb-install.sh
@@ -0,0 +1,16 @@
+#!/bin/sh
+
+# Copyright (c) 2020 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+# install build-essential (gcc and g++ included) and gfortran
+
+#Compile NPB
+
+echo "12345" | sudo apt-get install build-essential gfortran
+
+cd /home/gem5/NPB3.3-OMP/
+
+mkdir bin
+
+make suite HOOKS=1
diff --git a/src/npb/disk-image/npb/npb.json b/src/npb/disk-image/npb/npb.json
new file mode 100755
index 0000000..1f8df4c
--- /dev/null
+++ b/src/npb/disk-image/npb/npb.json
@@ -0,0 +1,106 @@
+{
+    "_author": "Hoa Nguyen <hoanguyen@ucdavis.edu>, Ayaz Akram <yazakram@ucdavis.edu>",
+    "_license": "Copyright (c) 2020 The Regents of the University of California. SPDX-License-Identifier: BSD 3-Clause",
+    "builders":
+    [
+        {
+            "type": "qemu",
+            "format": "raw",
+            "accelerator": "kvm",
+            "boot_command":
+            [
+                "{{ user `boot_command_prefix` }}",
+                "debian-installer={{ user `locale` }} auto locale={{ user `locale` }} kbd-chooser/method=us ",
+                "file=/floppy/{{ user `preseed` }} ",
+                "fb=false debconf/frontend=noninteractive ",
+                "hostname={{ user `hostname` }} ",
+                "/install/vmlinuz noapic ",
+                "initrd=/install/initrd.gz ",
+                "keyboard-configuration/modelcode=SKIP keyboard-configuration/layout=USA ",
+                "keyboard-configuration/variant=USA console-setup/ask_detect=false ",
+                "passwd/user-fullname={{ user `ssh_fullname` }} ",
+                "passwd/user-password={{ user `ssh_password` }} ",
+                "passwd/user-password-again={{ user `ssh_password` }} ",
+                "passwd/username={{ user `ssh_username` }} ",
+                "-- <enter>"
+            ],
+            "cpus": "{{ user `vm_cpus`}}",
+            "disk_size": "{{ user `image_size` }}",
+            "floppy_files":
+            [
+                "shared/{{ user `preseed` }}"
+            ],
+            "headless": "{{ user `headless` }}",
+            "http_directory": "shared/",
+            "iso_checksum": "{{ user `iso_checksum` }}",
+            "iso_checksum_type": "{{ user `iso_checksum_type` }}",
+            "iso_urls": [ "{{ user `iso_url` }}" ],
+            "memory": "{{ user `vm_memory`}}",
+            "output_directory": "npb/{{ user `image_name` }}-image",
+            "qemuargs":
+            [
+                [ "-cpu", "host" ],
+                [ "-display", "none" ]
+            ],
+            "qemu_binary":"/usr/bin/qemu-system-x86_64",
+            "shutdown_command": "echo '{{ user `ssh_password` }}'|sudo -S shutdown -P now",
+            "ssh_password": "{{ user `ssh_password` }}",
+            "ssh_username": "{{ user `ssh_username` }}",
+            "ssh_wait_timeout": "60m",
+            "vm_name": "{{ user `image_name` }}"
+        }
+    ],
+    "provisioners":
+    [
+        {
+            "type": "file",
+            "source": "../gem5/util/m5/build/x86/out/m5",
+            "destination": "/home/gem5/"
+        },
+        {
+            "type": "file",
+            "source": "shared/serial-getty@.service",
+            "destination": "/home/gem5/"
+        },
+        {
+            "type": "file",
+            "source": "npb/runscript.sh",
+            "destination": "/home/gem5/"
+        },
+        {
+            "type": "file",
+            "source": "npb/npb-hooks/NPB3.3.1/NPB3.3-OMP",
+            "destination": "/home/gem5/"
+        },
+        {
+            "type": "shell",
+            "execute_command": "echo '{{ user `ssh_password` }}' | {{.Vars}} sudo -E -S bash '{{.Path}}'",
+            "scripts":
+            [
+                "npb/post-installation.sh",
+                "npb/npb-install.sh"
+            ]
+        }
+    ],
+    "variables":
+    {
+        "boot_command_prefix": "<enter><wait><f6><esc><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs>",
+        "desktop": "false",
+        "image_size": "12000",
+        "headless": "true",
+        "iso_checksum": "34416ff83179728d54583bf3f18d42d2",
+        "iso_checksum_type": "md5",
+        "iso_name": "ubuntu-18.04.2-server-amd64.iso",
+        "iso_url": "http://old-releases.ubuntu.com/releases/18.04.2/ubuntu-18.04.2-server-amd64.iso",
+        "locale": "en_US",
+        "preseed" : "preseed.cfg",
+        "hostname": "gem5",
+        "ssh_fullname": "gem5",
+        "ssh_password": "12345",
+        "ssh_username": "gem5",
+        "vm_cpus": "4",
+        "vm_memory": "8192",
+        "image_name": "npb"
+  }
+
+}
diff --git a/src/npb/disk-image/npb/post-installation.sh b/src/npb/disk-image/npb/post-installation.sh
new file mode 100755
index 0000000..0ecb806
--- /dev/null
+++ b/src/npb/disk-image/npb/post-installation.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+# Copyright (c) 2020 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+echo 'Post Installation Started'
+
+mv /home/gem5/serial-getty@.service /lib/systemd/system/
+
+mv /home/gem5/m5 /sbin
+ln -s /sbin/m5 /sbin/gem5
+
+# copy and run outside (host) script after booting
+cat /home/gem5/runscript.sh >> /root/.bashrc
+
+echo 'Post Installation Done'
diff --git a/src/npb/disk-image/npb/runscript.sh b/src/npb/disk-image/npb/runscript.sh
new file mode 100755
index 0000000..15e4377
--- /dev/null
+++ b/src/npb/disk-image/npb/runscript.sh
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+# Copyright (c) 2020 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+m5 readfile > script.sh
+if [ -s script.sh ]; then
+    # if the file is not empty, execute it
+    chmod +x script.sh
+    ./script.sh
+    m5 exit
+fi
+# otherwise, drop to the terminal
diff --git a/src/npb/disk-image/shared/preseed.cfg b/src/npb/disk-image/shared/preseed.cfg
new file mode 100755
index 0000000..b5cd8a7
--- /dev/null
+++ b/src/npb/disk-image/shared/preseed.cfg
@@ -0,0 +1,96 @@
+# Copyright (c) 2020 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+# Choosing keyboard layout
+d-i debian-installer/locale string en_US
+d-i console-setup/ask_detect boolean false
+d-i keyboard-configuration/xkb-keymap select us
+
+# Choosing network interface
+d-i netcfg/choose_interface select auto
+
+# Assigning hostname and domain
+d-i netcfg/get_hostname string gem5-host
+d-i netcfg/get_domain string gem5-domain
+
+d-i netcfg/wireless_wep string
+
+# https://unix.stackexchange.com/q/216348
+# The above link says there's no way to not to set a mirror
+# Should choose a local minor
+d-i mirror/country string manual
+d-i mirror/http/hostname string archive.ubuntu.com
+d-i mirror/http/directory string /ubuntu
+d-i mirror/http/proxy string
+
+# Setting up `root` password
+d-i passwd/root-login boolean false
+
+# Creating a normal user account. This account has sudo permission.
+d-i passwd/user-fullname string gem5
+d-i passwd/username string gem5
+d-i passwd/user-password password 12345
+d-i passwd/user-password-again password 12345
+d-i user-setup/allow-password-weak boolean true
+
+# No home folder encryption
+d-i user-setup/encrypt-home boolean false
+
+# Choosing the clock timezone
+d-i clock-setup/utc boolean true
+d-i time/zone string US/Eastern
+d-i clock-setup/ntp boolean true
+
+# Choosing partition scheme
+# This setting should result in MBR
+# gem5 doesn't work with logical volumes
+d-i partman-auto/method string regular
+d-i partman-lvm/device_remove_lvm boolean true
+d-i partman-md/device_remove_md boolean true
+d-i partman-lvm/confirm boolean true
+d-i partman-lvm/confirm_nooverwrite boolean true
+
+# Ignoring an option to set the home folder in another partition
+d-i partman-auto/choose_recipe select atomic
+
+# Finishing disk partition settings
+d-i partman-md/confirm boolean true
+d-i partman-partitioning/confirm_write_new_label boolean true
+d-i partman/choose_partition select finish
+d-i partman/confirm boolean true
+d-i partman/confirm_nooverwrite boolean true
+
+# Installing standard packages and ubuntu-server packages
+# More details about ubuntu standard packages:
+# https://packages.ubuntu.com/bionic/ubuntu-standard
+# More details about ubuntu-server packages:
+# https://packages.ubuntu.com/bionic/ubuntu-server
+tasksel tasksel/first multiselect standard, ubuntu-server
+
+# openssh-server is required for communicating with Packer
+# build-essential has standard compiling tools, could be removed
+d-i pkgsel/include string openssh-server build-essential
+# No package upgrade
+d-i pkgsel/upgrade select none
+
+# Updating packages automatically is unnecessary
+d-i pkgsel/update-policy select none
+
+# Choosing not to report installed software to some servers
+popularity-contest popularity-contest/participate boolean false
+
+# Installing grub
+d-i grub-installer/only_debian boolean true
+
+# Specifying which partition to boot
+d-i grub-installer/bootdev  string /dev/sda
+
+# Install to the above partition
+d-i grub-installer/bootdev  string default
+
+# Answering the prompt saying the installation is finished
+d-i finish-install/reboot_in_progress note
+
+# Answering the prompt saying no bootloader is installed
+# This will appear if grub is not installed
+nobootloader nobootloader/confirmation_common note
diff --git a/src/npb/disk-image/shared/serial-getty@.service b/src/npb/disk-image/shared/serial-getty@.service
new file mode 100644
index 0000000..b0424f0
--- /dev/null
+++ b/src/npb/disk-image/shared/serial-getty@.service
@@ -0,0 +1,46 @@
+#  SPDX-License-Identifier: LGPL-2.1+
+#
+#  This file is part of systemd.
+#
+#  systemd is free software; you can redistribute it and/or modify it
+#  under the terms of the GNU Lesser General Public License as published by
+#  the Free Software Foundation; either version 2.1 of the License, or
+#  (at your option) any later version.
+
+[Unit]
+Description=Serial Getty on %I
+Documentation=man:agetty(8) man:systemd-getty-generator(8)
+Documentation=http://0pointer.de/blog/projects/serial-console.html
+BindsTo=dev-%i.device
+After=dev-%i.device systemd-user-sessions.service plymouth-quit-wait.service getty-pre.target
+After=rc-local.service
+
+# If additional gettys are spawned during boot then we should make
+# sure that this is synchronized before getty.target, even though
+# getty.target didn't actually pull it in.
+Before=getty.target
+IgnoreOnIsolate=yes
+
+# IgnoreOnIsolate causes issues with sulogin, if someone isolates
+# rescue.target or starts rescue.service from multi-user.target or
+# graphical.target.
+Conflicts=rescue.service
+Before=rescue.service
+
+[Service]
+# The '-o' option value tells agetty to replace 'login' arguments with an
+# option to preserve environment (-p), followed by '--' for safety, and then
+# the entered username.
+ExecStart=-/sbin/agetty --autologin root --keep-baud 115200,38400,9600 %I $TERM
+Type=idle
+Restart=always
+UtmpIdentifier=%I
+TTYPath=/dev/%I
+TTYReset=yes
+TTYVHangup=yes
+KillMode=process
+IgnoreSIGPIPE=no
+SendSIGHUP=yes
+
+[Install]
+WantedBy=getty.target