

# USING INTEL® VTUNE™ AMPLIFIER FOR PROFILING AND OPTIMIZATIONS

Wang Yang Intel IAGS CPDP August, 2019





# Intel® VTune™ Amplifier – Tool Suite Options



### INTEL® PARALLEL STUDIO

**PROFESSIONAL & CLUSTER EDITIONS** 

Improve performance, scalability, & reliability for parallel applications



intel.ly/parallel-studio-xe



### INTEL® SYSTEM STUDIO

PROFESSIONAL & ULTIMATE EDITIONS

Develop IOT/embedded & system solutions & apps faster



intel.ly/system-studio

Free/discounted versions are available for Students & Academia



### INTEL® MEDIA SERVER STUDIO

PROFESSIONAL EDITION

Deliver fast, high density & quality media/video processing



intel.ly/intel-media-server-studio

### INTEL® VTUNE™ AMPLIFIER AVAILABLE INDIVIDUALLY

**Analyze & Tune Application Performance & Scalability** 

Provides deep insight that saves time optimizing code



intel.ly/vtune-amplifier-xe

### What's Inside Intel® Parallel Studio XE

Comprehensive Software Development Tool Suite

### **COMPOSER EDITION**

#### **BUILD**

Compilers & Libraries

C / C++, Inte

Intel® Math Kernel Library

Intel® Data Analytics Acceleration Library

Intel Threading Building Blocks
C++ Threading

Intel® Integrated Performance Primitives
Image, Signal & Data Processing

Intel® Distribution for Python\*
High Performance Python

### PROFESSIONAL EDITION

#### **ANALYZE**

Analysis Tools

Intel® VTune™ Amplifier
Performance Profiler

### Intel® Inspector

Memory, Thread & Persistence Debugger

#### Intel® Advisor

Vectorization Optimization Thread Prototyping & Flow Graph Analysis

### **CLUSTER EDITION**

#### **SCALE**

**Cluster Tools** 

Intel® MPI Library
Message Passing Interface Library

Intel® Trace Analyzer & Collector MPI Tuning & Analysis

Intel® Cluster Checker
Cluster Diagnostic Expert System

Operating System: Windows\*, Linux\*, MacOS1\*

Intel® Architecture Platforms

Compilers







## Analyze & Tune Application Performance

Intel® VTune™ Amplifier—Performance Profiler



Learn More: software.intel.com/intel-vtune-amplifier-xe

### Save Time Optimizing Code

- Accurately profile C, C++, Fortran\*, Python\*, Go\*, Java\*, or any mix
- Optimize CPU, threading, memory, cache, storage & more
- Save time: rich analysis leads to insight
- Take advantage of <u>Priority Support</u>
  - Connects customers to Intel engineers for confidential inquiries (paid versions)

### What's New in 2019 Release (partial list)

- New Platform Profiler! Longer Data Collection
- A more accessible user interface provides a simplified profiling workflow
- Smarter, faster Application Performance Snapshot: Analyze
   CPU utilization of physical cores, pause/resume, more... (Linux\*)
- Improved JIT profiling for server-side/cloud applications



# Rich Set of Profiling Capabilities for Multiple Markets

Intel® VTune Amplifier









#### Single Thread

Optimize single-threaded performance.

#### Multithreaded

Effectively use all available cores.

# System

See a system-level view of application performance.

#### Media & OpenCL™ Applications

Deliver high-performance image and video processing pipelines.





#### **HPC & CLoud**

Access specialized, in-depth analyses for HPC and cloud computing.



#### Memory & Storage Management

Diagnose memory, storage, and data plane bottlenecks.



#### Analyze & Filter Data

Mine data for answers.

#### Environment

Fits your environment and workflow.

### Find Answers Fast

Intel® VTune™ Amplifier

### Adjust Data Grouping

Function - Call Stack

Module - Function - Call Stack

Source File - Function - Call Stack

Thread - Function - Call Stack

... (Partial list shown)

Double Click Function to View Source

Click [▶] for Call Stack

Filter by Timeline Selection (or by Grid Selection)





Filter by Process & Other Controls

Tuning Opportunities Shown in Pink. Hover for Tips

# See Profile Data On Source / Asm

Double Click from Grid or Timeline

View Source / Asm or both CPU Time Right click for instruction reference manual



Scroll Bar "Heat Map" is an overview of hot spots

Click jump to scroll Asm

### Timeline Visualizes Thread Behavior

Intel® VTune™ Amplifier



Optional: Use API to mark frames and user tasks Frame Suser Task



Optional: Add a mark during collection



# Intel® VTune™ Amplifier 2019

### Easier Setup, More Intelligible Results

### Fresh, Accessible Analysis Setup

- Simplified workflow
- More familiar terminology
- More logical groupings

### **Performance Insights**

Suggestions for further analysis

### Improved Displays

New hardware pipeline display



### Hotspots Insights

If you see significant hotspots in the Top Hotspots list, switch to the Bottom-up view for in-depth analysis per function. Otherwise, use the Caller/Callee view to track critical paths for these hotspots.

#### **Explore Additional Insights**

Parallelism : 17.8% (15.622 out of 88 logical CPUs) №

Use O Concurrency to explore more opportunities to increase parallelism in your application.



# Intel® VTune™ Amplifier Performance Profiler 2019

New workflow provides easier to learn tuning workflow and a simplified setup



# Hotspots Analysis – Your first step for optimization

#### HOW

#### Use this mode for:

- Profiles longer than a few seconds
- Profiling a single process or a process-tree
- Profiling Python and Intel runtimes

#### Use this mode for:

- Profiles shorter than a few seconds
- Profiling all processes on a system, including kernel



### Hotspots



Overhead



Identify the most time consuming functions and drill down to see time spent on each line of source code. Focus optimization efforts on hot code for the greatest performance impact. Learn more

- User-Mode Sampling ③
- Hardware Event-Based Sampling ③

CPU sampling interval, ms

1

Collect stacks

✓ Show additional performance insights

Details



# Case Study: Hotspot Analysis

- Case about Adler32 checksum calculation
- A developer who works on storage applications needs to calculate the adler32 checksum for big files
- Opensource Adler32 implementation was used: Source Code Here
- Run the VTune "Advance Hotspots" analysis for the sample application and investigate the result

#### ■ GitHub, Inc. [US] | https://github.com/madler/zlib/blob/master/adler32.c

```
/* initial Adler-32 value (deferred check for len == 1 speed) */
if (buf == Z NULL)
    return 1L:
/* in case short lengths are provided, keep it somewhat fast */
if (len < 16) {
    while (len--) {
        adler += *buf++;
        sum2 += adler:
    if (adler >= BASE)
        adler -= BASE;
    MOD28(sum2);
                            /* only added so many BASE's */
    return adler | (sum2 << 16);
/* do length NMAX blocks -- requires just one modulo operation */
while (len >= NMAX) {
    len -= NMAX:
    n = NMAX / 16;
                            /* NMAX is divisible by 16 */
    do {
                            /* 16 sums unrolled */
        DO16(buf);
        buf += 16;
    } while (--n);
    MOD(adler);
    MOD(sum2);
/* do remaining bytes (less than NMAX, still just one modulo) */
if (len) {
                            /* avoid modulos if none remaining */
```

# Hotspots for Adler32 opensource implementation



# Source/Assembly correlated view in VTune



Check the assembly code and we find the loop is not vectorized

Use the optimized
Adler32 function
from Intel IPP library
should help!

## Switch to the IPP implementation

```
uLong ZEXPORT adler32(adler, buf, len)
    uLong adler:
    const Bytef *buf;
    uInt len;
    unsigned long sum2:
    unsigned n;
    /* split Adler-32 into component sums */
    sum2 = (adler >> 16) & 0xffff;
    adler &= 0xffff;
    /* initial Adler-32 value (deferred check for len == 1 speed) */
    if (buf == Z NULL)
        return 1L:
    /* do length NMAX blocks -- requires just one modulo operation */
    while (len >= NMAX) {
        len -= NMAX:
                                /* NMAX is divisible by 16 */
        n = NMAX / 16;
                                /* 16 sums unrolled */
            D016(buf):
            buf += 16;
        } while (--n);
        MOD(adler);
        MOD(sum2);
    /* return recombined sums */
    return adler | (sum2 << 16);
```

```
#include "ippdc.h"

JLong ZEXPORT adler32_ipp(adler, buf, len)
    uLong adler;
    const Bytef *buf;
    uInt len;|

{
    Ipp32u resAdler32 = (Ipp32u)adler;
    if( Z_NULL == buf ) return 1L;

    ippsAdler32_8u(buf, len, &resAdler32);

    return ((uLong)resAdler32 & 0xfffffffff);
}
```

Use the optimized IPP function to take advantage of HW features!



# Vectorized code can get significant performance



# Microarchitecture Exploration – The way to check CPU execution efficiency

https://software.intel.com/en-us/vtuneamplifier-help-microarchitecture-explorationanalysis

https://software.intel.com/en-us/vtuneamplifier-help-tuning-applications-using-atop-down-microarchitecture-analysis-method

https://software.intel.com/enus/articles/processor-specific-performanceanalysis-papers

# Microarchitecture Exploration





Analyze CPU microarchitecture bottlenecks affecting the performance of your application. This analysis type is based on the hardware event-based sampling collection. <u>Learn more</u>

▲ CPU frequency data collection is not supported on this platform.

CPU sampling interval, ms

10

Extend granularity for the top-level metrics:

- ✓ Front-End Bound
- ✓ Bad Speculation
- ✓ Memory Bound
- ✓ Core Bound
- ✓ Retiring

## A simplified CPU execution pipeline flow



### Bottleneck Domain – A Top-Down hierarchy

Performance is classified according to what happened for each slot available to the application or hotspot:



### Visualize the Micro-Architectural Bottleneck

### Intel® VTune™ Amplifier – Performance Profiler





# Case Study: Microarchitecture exploration analysis

It is real case from stackoverflow:

http://stackoverflow.com/questions/11 227809/why-is-processing-a-sortedarray-faster-than-an-unsorted-array

 Run the 'sumtest' with "General Exploration" analysis and investigate the result

Why a 'sort' on 'data' make huge performance difference?

Here is a piece of **C++** code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

```
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
    // Generate data
    const unsigned arraySize = 32768;
    int data[arravSize]:
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);
     lock t start = clock();
    long long sum = 0;
   for (unsigned i = 0; i < 100000; ++i)
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)</pre>
            if (data[c] >= 128)
                sum += data[c]:
    double elapsedTime = static cast<double>(clock() - start) / CLOCKS PER SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl:
```

- Without std::sort(data, data + arraySize); , the code runs in 11.54 seconds.
- With the sorted data, the code runs in 1.93 seconds.



## How can VTune Amplifier help to identify the root cause?

- Build the code with and without the 'sort'
- Use VTune Amplifier 'Microarchitecture Exploration' analysis for profiling

### Microarchitecture Exploration

- ✓ Hardware event-based sampling
- √ Good starting point to triage hardware issues
- ✓ A complete list of events is collected for analyzing
- ✓ It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems.



# Microarchitecture Exploration - Summary





VS

## Microarchitecture Exploration for the case without 'sort'



- Performance issue marked with pink
- VTune reported that this program has big 'Branch Mispredict' penalty

# Which branch code cause the problem?

Go to source view to check which branch source code cause the problem

|            |                                             | *              |                         |             | >>    | Bad Speculation 🛚 🕾  |                   |
|------------|---------------------------------------------|----------------|-------------------------|-------------|-------|----------------------|-------------------|
| So.<br>Li. | Source                                      | Clockticks     | Instructions<br>Retired | CPI<br>Rate | Re.   | Branch<br>Mispredict | Machine<br>Clears |
| 6          | -                                           |                |                         |             |       |                      |                   |
| 7          | // Generate data                            |                |                         |             |       |                      |                   |
| 8          | const unsigned arraySize = 32768;           |                |                         |             |       |                      |                   |
| 9          | int data[arraySize];                        |                |                         |             |       |                      |                   |
| 10         |                                             |                |                         |             |       |                      |                   |
| 11         | for (unsigned c = 0; c < arraySize; ++c)    |                |                         |             |       |                      |                   |
| 12         | data[c] = std::rand() % 256;                |                |                         |             |       |                      |                   |
| 13         |                                             |                |                         |             |       |                      |                   |
| 14         | // !!! With this, the next loop runs faster |                |                         |             |       |                      |                   |
| 15         | //std::sort(data, data + arraySize);        |                |                         |             |       |                      |                   |
| 16         |                                             |                |                         |             |       |                      |                   |
| 17         | // Test                                     |                |                         |             |       |                      |                   |
| 18         | clock_t start = clock();                    |                |                         |             |       |                      |                   |
| 19         | long long sum = 0;                          |                |                         |             |       |                      |                   |
| 20         |                                             |                |                         |             |       |                      |                   |
| 21         | for (unsigned i = 0; i < 100000; ++i)       |                |                         |             |       |                      |                   |
| 22         | {                                           |                |                         |             |       |                      |                   |
| 23         | // Primary loop                             |                |                         |             |       |                      |                   |
| 24         | for (unsigned c = 0; c < arraySize; ++c)    | 4,872,007,308  | 1,444,002,166           | 3.374       | 14.0% | 42.1%                | 0.0%              |
| 25         | {                                           |                |                         |             |       |                      |                   |
| 26         | if (data[c] >= 128)                         | 12,200,018,300 | 4,198,006,297           | 2.906       | 5.3%  | 40.2%                | 0.0%              |
| 27         | <pre>sum += data[c];</pre>                  | 21,602,032,403 | 15,664,023,4            | 1.379       | 10.6% | 0.0%                 | 30.8%             |
| 28         | }                                           |                |                         |             |       |                      |                   |
| 29         | }                                           |                |                         |             |       |                      |                   |

# Microarchitecture Exploration for the case with 'sort'



- · 'sort' does help CPU to make the right decision on the branch prediction
- The 'Branch Mispredict' disappear, the performance (clockticks) improved significantly

# Case Study: Microarchitecture exploration analysis

- A case from stackoverflow: <u>http://stackoverflow.com/questions/7</u> <u>327994/prefetching-examples</u>
- 'array' is in memory
- The index is not a predictable value, hard for HW to prefetch array[mid]
- We may see high LLC cache miss for this case
- Run the customized analysis with Microarchitecture exploration for application "binary\_search" and investigate

```
int binarySearch(int *array, int number of elements, int key) {
       int low = 0, high = number of elements-1, mid;
       while(low <= high) {
               mid = (low + high)/2;
        What performance issue here?
               if(array[mid] < key)
                       low = mid + 1;
               else if(array[mid] == key)
                       return mid;
               else if(array[mid] > key)
                       high = mid-1;
       return -1;
int main() {
    int SIZE = 1024*1024*512;
   int *array = malloc(SIZE*sizeof(int));
   for (int i=0;i<SIZE;i++){
     array[i] = i;
   int NUM LOOKUPS = 1024*1024*8;
   srand(time(NULL));
   int *lookups = malloc(NUM LOOKUPS * sizeof(int));
   for (int i=0;i<NUM LOOKUPS;i++){
     lookups[i] = rand() % SIZE;
   for (int i=0;i<NUM LOOKUPS;i++){
     int result = binarySearch(array, SIZE, lookups[i]);
   free(array);
   free(lookups);
```

Profiling with VTune and check the results

# Microarchitecture exploration analysis Result







# Change the code with data prefetch

Add prefetch to reduce the cache miss

Good performance gain

```
int binarySearch(int *array, int number of elements, int key) {
        int low = 0, high = number of elements-1, mid;
        while(low <= high) {
                mid = (low + high)/2;
             raer DO PREFEICH
            builtin prefetch (&array[(mid + 1 + high)/2], 0, 1);
           // high path
             builtin prefetch (&array[(low + mid - 1)/2], 0, 1);
                if(array[mid] < key)
                        low = mid + 1;
                else if(array[mid] == key)
                        return mid;
                else if(array[mid] > key)
                        high = mid-1;
       return -1;
int main() {
    int SIZE = 1024*1024*512;
    int *array = malloc(SIZE*sizeof(int));
    for (int i=0;i<SIZE;i++){
      array[i] = i;
    int NUM LOOKUPS = 1024*1024*8;
    srand(time(NULL));
    int *lookups = malloc(NUM LOOKUPS * sizeof(int));
    for (int i=0;i<NUM LOOKUPS;i++){
      lookups[i] = rand() % SIZE;
    for (int i=0;i<NUM LOOKUPS;i++){
      int result = binarySearch(array, SIZE, lookups[i]);
    free(array);
    free(lookups);
```

When I compile and run this example with DO\_PREFETCH enabled, I see a 20% reduction in runtime:

```
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
```

# Case Study – Use vTune with DPDK IO API

- Using DPDK l3fwd to test the maximum forwarding performance.
   When CPU number is increased, the throughput doesn't increase.
- Testing with 1 CPU Core 32Mpps for 64 bytes packet.
- Testing with 2 CPU cores Still about 32M pps
- What is the bottleneck? CPU or NIC?



# Intel® VTune™ Amplifier 2019 – "Input and Output" analysis

The "Input and Output" analysis from Intel® VTune™ Amplifier 2019 pinpoints a



CPU is not fully utilized for packets receiving. 50% CPU time is DPDK Rx Spin Time with 0 number of packets fetched → CPU is not the bottleneck!

# Intel® VTune™ Amplifier 2019 – "Input and Output" analysis



- Add a new NIC and enable the packets receiving with two NICs.
- DPDK Rx Spin Time is almost 0. CPU always get x number of packets.
- The throughput performance increased as expected.

### More Resources

### Intel® VTune™ Amplifier – Performance Profiler

- Product page overview, features, FAQs...
- <u>Training materials</u> tech briefs, documentation, eval guides..
- Reviews
- Support forums, secure support...

### Additional Analysis Tools

- Intel® Inspector memory and thread checker/ debugger
- <u>Intel® Advisor</u> vectorization optimization and thread protot
- Intel® Trace Analyzer and Collector MPI Analyzer and Profil

### Additional Development Products

Intel<sup>®</sup> Software Development Products

### **Webinars**

### Free in-depth presentations

- Register
- View Archives

### What's New?

Purchase includes a year of updates. Check out the latest improvements.



# Legal Disclaimer & Optimization Notice

Performance results are based on testing as of August 2017 to September 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="https://www.intel.com/benchmarks">www.intel.com/benchmarks</a>.

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, the Intel logo, Pentium, Xeon, Core, VTune, OpenVINO, Cilk, are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

# **BACKUP**

