Profiling MPI Applications

Content experts: Rupak Roy, Xiao Zhu

INGREDIENTS
DIRECTIONS:

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

Application: heart_demo sample application
Tools:
- Intel® C++ Compiler
- Intel® MPI Library 2021.11
- Intel VTune Profiler 2024.0 or newer
- Intel VTune Profiler - Application Performance Snapshot
NOTE:
- Get a free download of the Intel MPI Library from https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html.
- Download the latest version of VTune Profiler from the product download page.
Operating system: Linux*
CPU: Intel® Xeon® Platinum 8480+ Processor (formerly code named Sapphire Rapids)

Build the Application

Build your application with debug symbols so Intel VTune Profiler can correlate performance data with your source code and assembly.

Clone the application GitHub repository to your local system:
```
git clone https://github.com/CardiacDemo/Cardiac_demo.git
```

Set up the Intel C++ Compiler and Intel MPI Library environment:

 
          source <compiler_install_dir>/oneapi/setvars.sh

In the root level of the sample package, create a build directory and open it:
```
mkdir build
cd build
```

Build the application:

7.	mpiicpx ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++17 -qopenmp -parallel-source-info=2

The executable heart_demo should be present in the current directory.

Establish Overall Performance Characteristics

Start tuning your MPI application by examining a snapshot of its performance, collected by Application Performance Snapshot in VTune Profiler. With this snapshot, you can understand the general properties of your application. Then focus on problematic areas using appropriate tools.

We begin by preparing a performance snapshot on a set of dual socket nodes using the Intel® Xeon® Scalable processor (code named Sapphire Rapids). This example uses Intel® Xeon® Platinum 8480+ Processor with 24 cores per socket. This processor configures the run to have 4 MPI ranks per node and 12 threads per rank. Modify the specific rank and thread counts in this example to match your own system specification.

To obtain a performance snapshot on four nodes, run this command in an interactive session or in a batch script :

export OMP_NUM_THREADS=12
mpirun -np 16 -ppn 4 aps ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100

When the analysis is complete, you can find profiling data in a directory named aps_result_YYYYMMDD, where the date of collection is included in YYYY/MM/DD format.

For example, to produce a single page HTML snapshot of the results collected on December 5 2023, type:

aps --report ./aps_result_20231205

The aps_report_YYYYMMDD_<stamp>.html file is created in your working directory, where the <stamp> number is used to prevent overwriting existing reports. The report contains information on overall performance, including:

MPI and OpenMP* imbalance
Memory footprint and physical core utilization
floating point throughput

A note at the top of the report highlights the main areas of concern for the application.

The snapshot indicates that this application is bound overall by MPI communication. The application also suffers from:

OpenMP* imbalance
Physical core utilization
Vectorization issues

The MPI Time section provides additional details, such as MPI imbalance and the top MPI function calls used. From this section, it appears that the code uses mainly point to point communication and that the imbalance is moderate.

This snapshot result points to complex issues in the code. To continue investigating the performance issues and isolate the problems, let us run the HPC Performance Characterization analysis in VTune Profiler next.

Configure and Run the HPC Performance Characterization Analysis

Most clusters are setup with login and compute nodes. Typically a user connects to a login node and uses a scheduler to submit a job to the compute nodes, where it executes. In a cluster environment, the most practical way to run VTune Profiler to profile an MPI application is by using the command line for data collection and the GUI for performance analysis, once the job has completed.

To report MPI-related metrics in a distributed environment, type:

<mpi launcher> [options] vtune [options] -r <results dir> -- <application> [arguments]

NOTE:

You can use the above command can be used in an interactive session or included in a batch submission script.
You must specify the results directory for MPI applications.
If you are not using the Intel MPI Library, add -trace-mpi to the above command .

Follow these steps to run the HPC Performance Characterization analysis in VTune Profiler from the command line:

Prepare your environment by sourcing the VTune Profiler files. For a default installation using the bash shell, use this command:
```
source /opt/intel/vtune_Profiler/vars.sh
```
Collect data for the heart_demo application using the hpc-performance analysis. The application uses both OpenMP and MPI. The application execution uses the configuration described earlier, with 16 MPI ranks over a total of 4 compute nodes using the Intel MPI Library. This example is run on Intel® Xeon® Platinum 8480 Processors and uses 12 OpenMP threads per MPI rank:
```
export OMP_NUM_THREADS=12
mpirun -np 16 -ppn 4 vtune -collect hpc-performance –r vtune_mpi -- ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100
```
The analysis begins and generates four output directories using the following naming convention: vtune_mpi.<node host name>.

NOTE:

You can select specific MPI ranks to be profiled while running others simultaneously, but without collecting profiling data. For details, see Selective MPI Rank Profiling.

Analyze Results using the Intel VTune Profiler GUI

Open one of the collected results in the VTune Profiler user interface:

vtune-gui ./vtune_mpi.node_1

NOTE:

To display the Intel VTune Profiler GUI, you need an X11 manager running on the local system or a VNC session connected to the system. Since each system is different, consult with your local administrator for a recommended method.

The result opens in Intel VTune Profiler and shows the Summary window. This window provides an overview of the application performance. Because heart_demo is an MPI parallel application, the Summary window shows MPI Imbalance information and details regarding the MPI rank in the execution critical path in addition to the usual metrics.

MPI Imbalance is an average MPI busy wait time by all ranks on the node. The value indicates how much time could be saved if the balance was ideal.
MPI Rank on the Critical Path is the rank with minimal busy wait time.
MPI Busy Wait Time and Top Serial Hotspots are shown for the rank in the critical path. This is a good way to identify severe deficiencies in scalability since they typically correlate with high imbalance or busy wait metrics. Significant MPI Busy Wait Time for the rank on the critical path in a multi-node run could imply that the outlier rank is on a different node.

In our example, there is some imbalance and also a significant amount of time spent in serial regions of the code (not shown in the figure).

While you can collect profiles across nodes, the only way to view all MPI data is to load each node result independently. For detailed MPI traces, use Intel® Trace Analyzer and Collector.

In Intel VTune Profiler 2024.0 (and newer versions), the Summary window contains histograms of bandwidth utilization. The metrics show bandwidth and packet rate and indicate the percentage of the execution time for which the code was bound by high bandwidth or packet rate utilization. The histogram shows a maximum DRAM bandwidth utilization of 6 GB/s, which is low. This tells us that there is still room for improvement.

Switch to the Bottom-up tab to get more details. Set the Grouping to have Process at the top level. You should see this view:

Since this code uses both MPI and OpenMP, the Bottom-up window shows metrics related to both runtimes, in addition to the CPU and memory data. In our example, the OpenMP* Imbalance metric is highlighted in red. This hints that threading improvements could help performance.

Review the execution timeline for several metrics at the bottom of the Bottom-up window, including DDR and MCDRAM bandwidth, as well as CPU time. The UPI bandwidth timeline for this code shows continuous utilization at a moderate bandwidth (the scale is in GB/s).

Of more interest is the detailed execution time per thread and the breakdown of these metrics:

Effective Time
Spin and Overhead Time
MPI Busy Wait Times

The default view uses the Super Tiny settings to show all processes and threads together in a visual map of performance.

In this case you should see that there is little effective time in most of the threads (green) and that the amount of MPI overhead is also small (yellow). This points to potential issues in the threading implementation.

To investigate this further,

Right-click on the grey area to the left of the graph.
Select the Rich view for the band height.
To the right of the graph, group results by Process/Thread.

By selecting this grouping, you get better clarity with the roles of each MPI Rank and each thread. The top bar for each process shows the average result for all children threads. Below that average, each thread is listed with its own thread number and process ID.

In our example, the primary thread takes care of all MPI communication for each MPI rank. This behavior is common in hybrid applications. A significant amount of time is spent in MPI communication (yellow) in the first ten seconds of the execution, likely to set up the problem and distribute data. After that period, there is regular MPI communication, which matches the results observed in the Bandwidth Utilization timeline and the Summary report.

The high amount of spin and overhead (shown in red by default) is noticeable. This indicates issues with the way threading was implemented in the application.

At the top of the Bottom-up window, group the data by OpenMP Region / Thread / Function / Call Stack.
Apply the filter at the bottom of the window to show Functions only.
Expand the tree to see that the function init_send_bufs is only called by thread 0 and is responsible for the low performance observed.
Double click on a line to open the source code viewer.

Generate a Command Line Run from the Intel VTune Profiler GUI (optional)

You can configure an analysis in Intel VTune Profiler using the GUI and then save the equivalent command to run the analysis directly from the command line. Use this feature for heavily customized profiles or for quickly building a complex command.

Open Intel VTune Profiler.
Click New Project or open an existing project.
Click Configure Analysis.
In the Where pane, select Arbitrary Host (not connected) and specify the hardware platform.
In the What pane:
1. Specify the application.
2. Set the parameters and working directory.
3. Select the Use MPI launcher option and provide information related to the MPI run.
4. [Optional] Choose particular ranks to profile.
In the How pane, change the default Hotspots analysis to HPC Performance Characterization. Customize the available options.
Click the Command Line button at the bottom of the window. A pop-up window displays the equivalent command you should run to perform the customize analysis you just configured on the GUI. You can add additional MPI options to complete the command.

NOTE:

For Intel MPI, the command line is generated in terms of the -gtool option. Use this option to simplify selective rank profiling syntax.

Analyze Results with a Command Line Report (optional)

Intel VTune Profiler provides informative command line text reports. For example, to obtain a summary report, run:

vtune -report summary -r ./results_dir

A summary of the results prints to the screen. Options to save the output directly to file and in other formats (csv, xml, html) are also available. For details on the full command line options, type vtunel -help in the command line or see Intel® VTune™ Profiler Command Line Interface.

Selective Code Area Profiling (optional)

By default, Intel VTune Profiler collects performance statistics for the whole application. The 2019.3 and newer versions of Intel VTune Profiler contain the ability to control data collection for MPI applications. There are several advantages to this capability:

You can generate smaller result files.
Result files process quickly.
You can completely focus on a region of interest.

The region selection process is done using the standard MPI_Pcontrol function. Call MPI_Pcontrol(0) to pause data collection and call MPI_Pcontrol(1) to resume it again.

You can use the API together with the command line option -start-paused to exclude the application initialization phase. In this case, a MPI_Pcontrol(1) call should follow right after initialization to resume data collection. This method of controlling collection requires no changes in the application building process, unlike using ITT API calls, which require linking of a static ITT API library.

Additional Resources

Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Cookbook