admin管理员组

文章数量:1630387

CUDAMicroBench: Microbenchmark to Assist CUDA Performance Programming

Summarise Chapters

Abstract

  • GPU Complex Memory Hierarchy - BOTTLENECK

  • Microbenchmarking: a set of 14 microbenchmarks, highlight: inefficient memory access patterns and suboptimal usage of parallelism.

  • Advanced CUDA Features

    1. Data Shuffling between Threads: within the same block, avoid redundant memory accesses.
    2. Dynamic Parallelism: a kernel to launch other kernels for nested parallelism.
  • Evaluation Tool: performance of GPU architectures and memory systems; Also assess the effectiveness of compilers and performance analysis tools.

N.B. 

  • Memory Hierarchy on GPUs: GPUs have different memory types:

    1. Global Memory: Large but slow. Accessible by all threads but with high latency.
    2. Shared Memory: Faster but limited in size. Shared by threads in a block.
    3. Registers: The fastest memory, but very limited. Only accessible by individual threads.
  • Thread Hierarchy in CUDA:

    1. Threads are grouped into warps (usually 32 threads).
    2. Warps are grouped into blocks.
    3. Blocks form a grid.
  • Challenges in GPU Programming:

    1. Warp Divergence: Inefficient if threads in a warp follow different execution paths (due to if-else conditions).
    2. Memory Bottlenecks: Non-coalesced memory addresses -- poor memory bandwidth utilization.

Introduction

  1. CUDA Programming Complexity.

  2. Optimization Strategies: Fine-tuning memory access patterns and data shuffling between threads

  3. Heterogeneous Systems: CPUs and GPUs work together.

Motivation

  1. Massive GPU Parallelism: the latest Ampere A100 model contains over 5,000 cores. Each Streaming Multiprocessor (SM) with several cores contains multiple Arithmetic Logic Units (ALUs). The GPU utilizes the Single Instruction, Multiple Threads (SIMT) execution model, where groups of 32 threads (warps) execute instructions in lockstep.

  2. Memory Hierarchy Complexitydeep memory hierarchy includes both on-chip (e.g., registers, local, shared memory) and off-chip (e.g., global, constant, texture memory) memory types. Ensure correct data access patterns; Discrete memory systems shared by the CPU and GPU require efficient data transfer. The Unified Memory in CUDA 6.0 still can introduce inefficiencies during data transfers.

Guidelines

        Saturate 1. Cores 2. Memory 3. Transfer bandwidth

Benchmark Overview

  • Warp-level optimizations (e.g., controlling warp divergence),

  • Memory access strategies (e.g., using shared memory to reduce global memory access latency),

  • Data movement improvements (e.g., asynchronous memory transfers to overlap communication with computation).

How to Optimise Kernels

Four key techniques: Warp Divergence, Dynamic Parallelism, Concurrent Kernels, and Task Graphs.

A. Warp Divergence

To avoid "threads in a warp following a different path than others, but all threads must still execute both paths before the relevant threads commit results."

__global__ void WD(float *x, float *y, float *z) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid % 2 == 0) {
        z[tid] = 2 * x[tid] + 3 * y[tid];
    } else {
        z[tid] = 3 * x[tid] + 2 * y[tid];
    }
}

__global__ void noWD(float *x, float *y, float *z) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if ((tid / warpSize) % 2 == 0) {
        z[tid] = 2 * x[tid] + 3 * y[tid];
    } else {
        z[tid] = 3 * x[tid] + 2 * y[tid];
    }
}

Use Cases:

Ideal for kernels with conditional logic where thread IDs are involved. 85.71% - 100% improved. 

B. Dynamic Parallelism

CUDA 5.0 could launch other kernels directly from the GPU, rather than CPU. Useful for nested parallelism (e.g. adaptive grids, recursive algorithms)

__global__ void mandelbrot_block_k() {
    int comm_dwell = border_dwell();
    if (Perimeter dwells equal) {
        dwell_fill_k<<<grid, bs>>>();
    } else if (not hit subdivision limit and not hit depth limit) {
        mandelbrot_block_k<<<grid, bs>>>();
    } else {
        pixel_calc<<<grid, bs>>>();
    }
}

Use Cases:

Suitable for rendering complex graphics or simulations where certain regions require more computational detail. Significant speedups (e.g., 3.26x improvement for generating large images) but overhead for smaller workloads.

C. Concurrent Kernels

NVIDIA's Fermi architecture -- concurrent kernels on a single GPU. It maximizes GPU resource utilisation and is good for memory-bound kernels.

Example: Using asynchronous kernels associated with CUDA streams allows for concurrent execution, as visualized in the NVIDIA Visual Profiler (nvvp).

Use Cases:

High levels of concurrency + low latency (e.g. real-time data processing).  Approximately 7x. compare to serial.

D. Task Graph

CUDA 10. A structured way to define a series of operations with dependencies (like memory copies and kernel launches).

Use Cases:

Best suited for no excessive CPU involvement; Enhance programmability and can reduce CPU GPU communication overhead (like for repetitive tasks).

Effectively Leveraging the Deep Memory Hierarchy Inside GPUs

A. Using Shared Memory to Improve Performance

Fast, programmable SRAM (Static Random Access Memory) on the GPU accessible by all threads in the same block. Later than registers but offers a much larger capacity. Often used as a cache. (matrix multiplication)

B. Coalesced Memory Access

Chunked data transfer between global memory and on-chip storage. Adjacent threads' memory requests coalesced. (AXPY kernel: Block/Cyclic Distribution)

Suggest to use Compressed Storage Formats to Optimize Memory Access Density. Storing Read-only Data in Read-only Memory.

C. Memory Alignment for GPU Kernels

First accessed memory address = exact n* (memory transaction size). (AXPY)

D. Overlapping and Pipelining Data Copy Between Global Memory and Shared Memory

Asynchronous memory copying (memcpy async) allows for overlapping data transfer between global and shared memory. (AXPY)

E. Data Shuffle Between Threads

Post-Kepler architectures support it. Exchange data directly without using shared memory, thus avoiding bandwidth bottlenecks. (reduction algorithm)

F. Bank Conflicts Due to Strided Index

Shared memory is divided into multiple banks, and accessing different banks simultaneously can lead to bank conflicts. Use continuous reduction algorithm or adjust stride sizes. (reduction algorithm)

Related Work

A. Benchmark Suites for Evaluating GPUs

  1. Rodinia: heterogeneous computing using CUDA and OpenMP; multi-core CPUs and GPU data sharing.
  2. SPEC ACCEL: OpenCL, OpenACC, and OpenMP, measuring CPU and GPU performance along with memory and compiler performance.
  3. CUDAMicroBench: simpler kernels demonstrate performance challenges and optimization techniques specific to CUDA.

本文标签: LiteraturereviewCUDAMicroBench