For more complete information on NetPIPE, visit the webpage at: http://www.scl.ameslab.gov/Projects/NetPIPE/ NetPIPE was originally developed by Quinn Snell, Armin Mikler, John Gustafson, and Guy Helmer. It is currently being developed and maintained by Dave Turner with help from several graduate students (Xuehua Chen, Adam Oline, Brian Smith, Bogdan Vasiliu). Release 3.6.2 mainly fixes some bugs. A number of portability issues with 64-bit architectures were taken care of, especially in the Infiniband module. A small typecasting error was fixed that caused segmentation faults on Red Hat Enterprise and Fedora Core systems (and probably others...). The bi-directional mode was also tested with the Infiniband module, and a subset of NetPIPE options are now supported. Release 3.6.1 adds a bi-directional (-2) mode to allow data to be sent in both directions simultaneously. This has been tested with the TCP, MPI, MPI-2, and GM modules. You can also now test synchronous MPI communications MPI_SSend/MPI_SRecv using (-S). A launch utility (nplaunch) allows you to launch NPtcp, NPgm, NPib, and NPpvm from one side using ssh to start the remote executible. Version 3.6 adds the ability to test with and without cache effects, and the ability to offset both the source and destination buffers. A memcpy module has also been added. Release 3.5 removes the CPU utilization measurements. Getrusage is probably not very accurate, so a dummy workload will eventually be used instead. The streaming mode has also been fixed. When run at Gigabit speeds, the TCP window size would collapse limit performance of subsequent data points. Now we reset the sockets between trials to prevent this. We have also added in a module to evaluate memory copy rates. -n now sets a constant number of repeats for each trial. -r resets the sockets between each trial (automatic for streaming). Release 3.3 includes an Infiniband module for the Mellanox VAPI. It also has an integrity check (-i), which is still being developed. Version 3.2 includes additional modules to test PVM, TCGMSG, SHMEM, and MPI-2, as well as the GM, GPSHMEM, ARMCI, and LAPI software layers they run upon. If you have problems or comments, please email netpipe@scl.ameslab.gov ____________________________________________________________________________ NetPIPE Network Protocol Independent Performance Evaluator, Release 2.3 Copyright 1997, 1998 Iowa State University Research Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. ____________________________________________________________________________ Building NetPIPE ---------------- NetPIPE requires an ANSI C compiler. You are on your own for installing the various libraries that NetPIPE can be used to test. Review the provided makefile and change any necessary settings, such as the CC compiler or CFLAGS flags, required extra libraries, and PVM library & include file pathnames if you have these communication libraries. Alternatively, you can specify these changes on the make command line. The line below would compile the NPtcp module using the icc compiler instead of the default cc compiler. make CC=icc tcp Compile NetPIPE with the desired communication interface by using: make mpi (this will use the default MPI on the system) make pvm (you may need to set some paths in the makefile) make tcgmsg (you will need to set some paths in the makefile) make mpi2 (this will test 1-sided MPI_Put() functions) make shmem (1-sided library for Cray and SGI systems) make tcp make tcp6 (for IPv6 enabled systems) make gm (for Myrinet cards, you will need to set some paths) make shmem (1-sided library for Cray and SGI systems) make gpshmem (SHMEM interface for other machines) make armci (still under development) make lapi (for the IBM SP) make ib (for Mellanox Infiniband adapters, uses VAPI layer) make memcpy (uses memcpy to copy data between buffers in 1 process) make MP_memcpy (uses an optimized copy in MP_memcpy.c to copy data between buffers. This requires icc or gcc 3.x.) Running NetPIPE --------------- NetPIPE will dump its output to the screen by default and also to the np.out. The following parameters can be used to change how NetPIPE is run, and are in order of their general usefulness. -b: specify send and receive TCP buffer sizes e.g. "-b 32768" This can make a huge difference for Gigabit Ethernet cards. You may need to tune the OS to set a larger maximum TCP buffer size for optimal performance. -O: specify send and optionally receive buffer offsets, e.g. "-O 1,3" -l: lower bound (start value for block size) e.g. "-l 1" -u: upper bound (stop value for block size) e.g. "-u 1048576" -o: specify output filename e.g. "-o output.txt" -z: for MPI, receive messages using ANYSOURCE -g: MPI-2: use MPI_Get() instead of MPI_Put() -f: MPI-2: do not use a fence call (may not work for all packages) -I: Invalidate cache: Take measures to eliminate the effects cache has on performance. -a: asynchronous receive (a.k.a. pre-posted receive) May not have any effect, depending on your implementation -B: burst all preposts before measuring performance Normally only one receive is preposted at a time with -a -p: set perturbation offset of buffer size, e.g. "-p 3" -i: Integrity check: Check the integrity of data transfer instead of performance -s: stream option (default mode is "ping pong") If this option is used, it must be specified on both the sending and receiving processes -S: Use synchronous sends/receives for MPI. -2: Bi-directional communications. Transmit in both directions simultaneously. -P: Set the port number used by TCP to something other than default. TCP --- Compile NetPIPE using 'make tcp' remote_host> NPtcp [options] local_host> NPtcp -h remote_host [options] OR local_host> nplaunch NPtcp -h remote_host [options] TCP6 ---- Compile NetPIPE using 'make tcp6' remote_host> NPtcp6 [options] local_host> NPtcp6 -h remote_host [options] OR local_host> nplaunch NPtcp6 -h remote_host [options] MPICH ----- Install MPICH Compile NetPIPE using 'make mpi' use p4pg file or edit mpich/util/mach/mach.{ARCH} file to specify the machines to run on mpirun [-nolocal] -np 2 NPmpi [options] 'setenv P4_SOCKBUFSIZE 256000' can make a huge difference for MPICH on Unix systems. LAM/MPI (comes on the RedHat Linux distributions now) ------- Install LAM Compile NetPIPE using 'make mpi' put the machine names into a lamhosts file 'lamboot -v -b lamhosts' to start the lamd daemons mpirun -np 2 [-O] NPmpi [options] The -O parameter avoids data translation for homogeneous systems. MPI/Pro (commercial version) ------- Install MPI/Pro Compile NetPIPE using 'make mpi' put the machine names into /etc/machines or a local machine file mpirun -np 2 NPmpi [options] MP_Lite (A lightweight version of MPI) ------- Install MP_Lite (http://www.scl.ameslab.gov/Projects/MP_Lite/) Compile NetPIPE using 'make MP_Lite' mprun -np 2 -h {host1} {host2} NPmplite [options] PVM --- Install PVM (comes on the RedHat distributions now) Set the PVM paths in the makefile if necessary. Compile NetPIPE using 'make pvm' use the 'pvm' utility to start the pvmd daemons type 'pvm' to start it (this will also start pvmd on the local_host) pvm> help --> lists all commands pvm> add remote_host --> will start a pvmd on a machine called 'host2' pvm> quit --> when you have all the pvmd machines started remote_host> NPpvm [options] local_host> NPpvm -h remote_host [options] OR local_host> nplaunch NPpvm -h remote_host [options] Changing PVMDATA in netpipe.h and PvmRouteDirect in pvm.c can effect the performance greatly. TCGMSG (unlikely anyone will try this that doesn't know TCGMSG well) ------- Install TCGMSG package Set the TCGMSG paths in the makefile. Compile NetPIPE using 'make tcgmsg' create a NPtcgmsg.p file with hosts and paths (see hosts/NPtcgmsg.p) parallel NPtcgmsg (no options can be passed into this version) MPI-2 ----- Install the MPI package Compile NetPIPE using 'make mpi2' Follow the directions for running the MPI package from above The MPI_Put() function will be tested with fence calls by default. Use -g to test MPI_Get() instead, or -f to do MPI_Put() without fence calls (will not work with LAM). SHMEM ----- Must be run on a Cray or SGI system that supports SHMEM calls. Compile NetPIPE using 'make shmem' (Xuehua, fill out the rest) GPSHMEM (a General Purpose SHMEM library) (gpshmem.c in development) ------- Ask Ricky or Krzysztof for help :). GM (test the raw performance of GM on Myrinet cards) -- Install the GM package and configure the Myrinet cards Compile NetPIPE using 'make gm' remote_host> NPgm [options] local_host> NPgm -h remote_host [options] OR local_host> nplaunch NPgm -h remote_host [options] LAPI ---- Log into IBM SP machine at NERSC Compile NetPIPE using 'make lapi' To run interactively at NERSC: Set environment variable MP_MSG_API to lapi e.g. 'setenv MP_MSG_API lapi', 'export MP_MSG_API=lapi' Run NPlapi with '-procs 2' to tell the parallel environment you want 2 nodes. Use any other options that are applicable to NetPIPE. To submit a batch job at NERSC: Copy the file batchLapi from the 'hosts' directory to the directory containing NPlapi. Edit the copy of batchLapi: job_name: Identifying name of job, can be anything output: File to send stdout to error: File to send stderr to (most of NetPIPE's output will go here) tasks_per_node: Number of tasks to be run on each node node: Number of nodes to run on (Use a combination of the above two options to determine how NetPIPE runs. Use 1 task per node and 2 nodes to run benchmark between nodes. Use 2 tasks per node and 1 node to run benchmark on single node) Use whatever command-line options are appropriate for NetPIPE Submit the job with the command 'llsubmit batchLapi' Check status of all your jobs with 'llqs -u ' You should receive an email when the job finishes. The resulting output files will then be available. ARMCI ----- Install the ARMCI package Compile NetPIPE using 'make armci' Follow the directions for running the MPI package from above If running on interfaces other than the default, create a file called armci_hosts, containing two lines, one for each hostname, then run package. Infiniband ---------- This test will only work on machines connected via TCP/IP as well as Infiniband. Install Mellanox Infiniband adapters and software Make sure the adapters are up and running (e.g. Check that the Mellanox-supplied bandwidth/latency program, perf_main, works, if you have it). Compile NetPIPE using 'make ib' (The environment variable MTHOME needs to be set to the directory containing the include and lib directories for the Mellanox software). remote_host> NPib [-options] local_host> NPib -h remote_host [-options] OR local_host> nplaunch NPib -h remote_host [options] (remote_host should be the ip address or hostname of the other host) Other options: Use -m to select mtu size for Infiniband adapter. Valid values are 256, 512, 1024, 2048, 4096. Default is 1024. Use -t to select the communications type. Possible values are send_recv: basic send and receive send_recv_with_imm: send and receive with immediate data rdma_write: one-sided remote dma write rdma_write_with_imm: one-sided remote dma write with immediate data Default is send_recv. Use -c to select the message completion type. Possible values are local_poll: poll on last byte of receive buffer vapi_poll: use VAPI polling mechanism event: use VAPI event completion mechanism Default is local_poll. Interpreting the Results ------------------------ NetPIPE generates a np.out file by default, which can be renamed using the -o option. This file contains 3 columns: the number of bytes, the throughput in Mbps, and the round trip time divided by two. The first 2 columns can therefore be used to produce a throughput vs message size graph. The screen output contains this same information, plus the test number and the number of ping-pong's involved in the test. >more np.out 1 0.136403 0.00005593 2 0.274586 0.00005557 3 0.402104 0.00005692 4 0.545668 0.00005593 6 0.805053 0.00005686 8 1.039586 0.00005871 12 1.598912 0.00005726 13 1.700719 0.00005832 16 2.098007 0.00005818 19 2.340364 0.00006194 Invalidating Cache ------------------ The -I switch can be used to reduce the effects cache has on performance. Without the switch, NetPIPE tests the performance of communicating n-byte blocks by reading from an n-byte buffer on one node, sending data over the communications link, and writing to an n-byte buffer on the other node. For each block size, this trial will be repeated x times, where x typically starts out very large for small block sizes, and decreases as the block size grows. The same buffers on each node are used repeatedly, so after the first transfer the entire buffer will be in cache on each node, given that the block-size is less than the available cache. Thus each transfer after the first will be read from cache on one end and written into cache on the other. Depending on the cache architecture, a write to main memory may not occur on the receiving end during the transfer loop. While the performance measurements obtained from this method are certainly useful, it is also interesting to use the -I switch to measure performance when data is read from and written to main memory. In order to facilitate this, large pools of memory are allocated at startup, and each n-byte transfer comes from a region of the pool not in cache. Before each series of n-byte transfers, every byte of a large dummy buffer is individually accessed in order to flush the data for the transfer out of cache. After this step, the first n-byte transfer comes from the beginning of the large pool, the second comes from n-bytes after the beginning of the pool, and so on (note that stride between n-byte transfers will depend on the buffer alignment setting). In this way we make sure each read is coming from main memory. On the receiving end data is written into a large pool in the same fashion that it was read on the transmitting end. Data will first be written into cache. What happens next depends on the cache architecture, but one case is that no transfer to main memory occurs, YET. For moderately large block sizes, however, a large number of transfer iterations will cause reuse of cache memory. As this occurs, data in the cache location to be replaced must be written back to main memory, so we incur a performance penalty while we wait for the write. In summary, using the -I switch gives worst-case performance (i.e. all data transfers involve reading from or writing to memory not in cache) and not using the switch gives best-case performance (i.e. all data transfers involve only reading from or writing to memory in cache). Note that other combinations, such as reading from memory in cache and writing to memory not in cache, would give intermediary results. We chose to implement the methods that will measure the two extremes. Changes needed -------------- - we need to replace the getrusage stuff from version 2.4 with a dummy workload