Literature Review 2: CUDAMicroBench|电子爱好者

admin管理员组
文章数量:1630387

CUDAMicroBench: Microbenchmark to Assist CUDA Performance Programming

Summarise Chapters

Abstract

GPU Complex Memory Hierarchy - BOTTLENECK
Microbenchmarking: a set of 14 microbenchmarks, highlight: inefficient memory access patterns and suboptimal usage of parallelism.
Advanced CUDA Features:
1. Data Shuffling between Threads: within the same block, avoid redundant memory accesses.
2. Dynamic Parallelism: a kernel to launch other kernels for nested parallelism.
Evaluation Tool: performance of GPU architectures and memory systems; Also assess the effectiveness of compilers and performance analysis tools.

N.B.

Memory Hierarchy on GPUs: GPUs have different memory types:
1. Global Memory: Large but slow. Accessible by all threads but with high latency.
2. Shared Memory: Faster but limited in size. Shared by threads in a block.
3. Registers: The fastest memory, but very limited. Only accessible by individual threads.
Thread Hierarchy in CUDA:
1. Threads are grouped into warps (usually 32 threads).
2. Warps are grouped into blocks.
3. Blocks form a grid.
Challenges in GPU Programming:
1. Warp Divergence: Inefficient if threads in a warp follow different execution paths (due to if-else conditions).
2. Memory Bottlenecks: Non-coalesced memory addresses -- poor memory bandwidth utilization.

Introduction

CUDA Programming Complexity.
Optimization Strategies: Fine-tuning memory access patterns and data shuffling between threads.
Heterogeneous Systems: CPUs and GPUs work together.

Motivation

Massive GPU Parallelism: the latest Ampere A100 model contains over 5,000 cores. Each Streaming Multiprocessor (SM) with several cores contains multiple Arithmetic Logic Units (ALUs). The GPU utilizes the Single Instruction, Multiple Threads (SIMT) execution model, where groups of 32 threads (warps) execute instructions in lockstep.
Memory Hierarchy Complexity: deep memory hierarchy includes both on-chip (e.g., registers, local, shared memory) and off-chip (e.g., global, constant, texture memory) memory types. Ensure correct data access patterns; Discrete memory systems shared by the CPU and GPU require efficient data transfer. The Unified Memory in CUDA 6.0 still can introduce inefficiencies during data transfers.

Guidelines

Saturate 1. Cores 2. Memory 3. Transfer bandwidth

Benchmark Overview

Warp-level optimizations (e.g., controlling warp divergence),
Memory access strategies (e.g., using shared memory to reduce global memory access latency),
Data movement improvements (e.g., asynchronous memory transfers to overlap communication with computation).

How to Optimise Kernels

Four key techniques: Warp Divergence, Dynamic Parallelism, Concurrent Kernels, and Task Graphs.

A. Warp Divergence

To avoid "threads in a warp following a different path than others, but all threads must still execute both paths before the relevant threads commit results."

__global__ void WD(float *x, float *y, float *z) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid % 2 == 0) {
        z[tid] = 2 * x[tid] + 3 * y[tid];
    } else {
        z[tid] = 3 * x[tid] + 2 * y[tid];
    }
}

__global__ void noWD(float *x, float *y, float *z) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if ((tid / warpSize) % 2 == 0) {
        z[tid] = 2 * x[tid] + 3 * y[tid];
    } else {
        z[tid] = 3 * x[tid] + 2 * y[tid];
    }
}

Use Cases:

Ideal for kernels with conditional logic where thread IDs are involved. 85.71% - 100% improved.

B. Dynamic Parallelism

CUDA 5.0 could launch other kernels directly from the GPU, rather than CPU. Useful for nested parallelism (e.g. adaptive grids, recursive algorithms)

__global__ void mandelbrot_block_k() {
    int comm_dwell = border_dwell();
    if (Perimeter dwells equal) {
        dwell_fill_k<<<grid, bs>>>();
    } else if (not hit subdivision limit and not hit depth limit) {
        mandelbrot_block_k<<<grid, bs>>>();
    } else {
        pixel_calc<<<grid, bs>>>();
    }
}

Use Cases:

Suitable for rendering complex graphics or simulations where certain regions require more computational detail. Significant speedups (e.g., 3.26x improvement for generating large images) but overhead for smaller workloads.

C. Concurrent Kernels

NVIDIA's Fermi architecture -- concurrent kernels on a single GPU. It maximizes GPU resource utilisation and is good for memory-bound kernels.

Example: Using asynchronous kernels associated with CUDA streams allows for concurrent execution, as visualized in the NVIDIA Visual Profiler (nvvp).

Use Cases:

High levels of concurrency + low latency (e.g. real-time data processing). Approximately 7x. compare to serial.

D. Task Graph

CUDA 10. A structured way to define a series of operations with dependencies (like memory copies and kernel launches).

Use Cases:

Best suited for no excessive CPU involvement; Enhance programmability and can reduce CPU GPU communication overhead (like for repetitive tasks).

Effectively Leveraging the Deep Memory Hierarchy Inside GPUs

A. Using Shared Memory to Improve Performance

Fast, programmable SRAM (Static Random Access Memory) on the GPU accessible by all threads in the same block. Later than registers but offers a much larger capacity. Often used as a cache. (matrix multiplication)

B. Coalesced Memory Access

Chunked data transfer between global memory and on-chip storage. Adjacent threads' memory requests coalesced. (AXPY kernel: Block/Cyclic Distribution)

Suggest to use Compressed Storage Formats to Optimize Memory Access Density. Storing Read-only Data in Read-only Memory.

C. Memory Alignment for GPU Kernels

First accessed memory address = exact n* (memory transaction size). (AXPY)

D. Overlapping and Pipelining Data Copy Between Global Memory and Shared Memory

Asynchronous memory copying (memcpy async) allows for overlapping data transfer between global and shared memory. (AXPY)

E. Data Shuffle Between Threads

Post-Kepler architectures support it. Exchange data directly without using shared memory, thus avoiding bandwidth bottlenecks. (reduction algorithm)

F. Bank Conflicts Due to Strided Index

Shared memory is divided into multiple banks, and accessing different banks simultaneously can lead to bank conflicts. Use continuous reduction algorithm or adjust stride sizes. (reduction algorithm)

Related Work

A. Benchmark Suites for Evaluating GPUs

Rodinia: heterogeneous computing using CUDA and OpenMP; multi-core CPUs and GPU data sharing.
SPEC ACCEL: OpenCL, OpenACC, and OpenMP, measuring CPU and GPU performance along with memory and compiler performance.
CUDAMicroBench: simpler kernels demonstrate performance challenges and optimization techniques specific to CUDA.

本文标签： Literature review CUDAMicroBench

版权声明：本文标题：Literature Review 2: CUDAMicroBench 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dianzi/1729056114a1183982.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

xp系统

电子爱好者 - 最新技术资讯及电子产品介绍！

Literature Review 2: CUDAMicroBench

Summarise Chapters

Abstract

Introduction

Motivation

Guidelines

Benchmark Overview

How to Optimise Kernels

A. Warp Divergence

B. Dynamic Parallelism

C. Concurrent Kernels

D. Task Graph

Effectively Leveraging the Deep Memory Hierarchy Inside GPUs

A. Using Shared Memory to Improve Performance

B. Coalesced Memory Access

C. Memory Alignment for GPU Kernels

D. Overlapping and Pipelining Data Copy Between Global Memory and Shared Memory

E. Data Shuffle Between Threads

F. Bank Conflicts Due to Strided Index

Related Work

A. Benchmark Suites for Evaluating GPUs

更多相关文章

Paper writting - 5 - Literature review

Literature Review- Accounting Research

论文笔记：Multiple Object Tracking: A Literature Review

SCI论文Introduction和Literature review写作要点

Systematic Literature Review(SLR)

Black aesthetics in children's literature【翻译】

Literature Books

Deep Learning Literature 常用词中英文总结（一）

生活英语读写MOOC-Literature Tutor-有声名著阅读推荐

A literature review and classification of recommender systems research

Ａｃtive learning ｌiterature Survey

A Review of the Healthcare-Management (Modeling) Literature

The serious work of humor in postcolonial literature【翻译】

Literature Review高分模板句, 毕业论文Distinction必备!

Doki Doki Literature Club（sort 函数对结构体函数的排序、结构体字符串之间的比较）

Systematic Review 学习（一） 初识

ENG 3000 – INTRODUCTION TO LITERATURE FALL 2024Java

ASIA 342: Chinese Literature in Translation: The Vernacular Tradition. T1W 2024

Literature-humor.txt

NLP：《NLP Year in Review 2019&amp;NLP_2019_Highlights》2019年自然语言处理领域重要进展回顾及其解读

发表评论

推荐文章

mysql community edition是什么_MySQL 社区版本（Community Edition）

HP Envy系列重装系统时找不到磁盘驱动器的解决办法

html没保存电脑蓝屏文件损坏,缺失bootsafe64_ev.sys蓝屏 解决方法

灰色文献（Grey Literature）详解

libcurl-cookie_interface

热门文章

Failing package is: mysql-community-common-8.0.33-1.el7.x86_64 GPG Keys are configured as: file:e

HP 战66笔记本重装WIN10前的BIOS设置

电脑重装系统后各种环境和工具的安装配置

ae2020不支持的视频驱动程序_Premiere 2020安装后，不支持视频驱动程序，怎么解决？...

VMware Workstation 重装，清理，卸载不干净的问题

windows系统下安装JDK8

Windows+Ubuntu双硬盘双系统安装（GPT硬盘格式）

AI：DeepSpeed Chat(一款帮用户训练自己模型的工具且简单低成本快 RLHF 训练类ChatGPT高质量大模型)的简介、安装、使用方法之详细攻略

windows 打开控制面板及任务管理器 快捷键

鸿蒙系统用没有安卓的代码,套壳？不存在！纯鸿蒙系统不含任何安卓代码，其他手机厂商可使用...

最新文章

VSCode快速配置C语言环境

全网最详细的深度学习tensorflow-gpu环境配置

Mysql的安装和配置教程（超详细图文）从零基础入门到精通，看完这一篇就够了

缺氧游戏计算机,缺氧PC最低什么配置一览 你觉得高吗

oracle的安装及配置环境教程，看这一个就够了

计算机设备管理器怎么看主板,电脑主板型号信息查看方法

我的世界java最低配置要求,我的世界需要什么配置

锐捷交换机配置保存到计算机,锐捷交换机常用配置命令汇总

MacOS下安装及配置Maven

linux系统的最全最详细的虚拟机配置（包含问题解决方案）

秒懂Git之配置(配置git默认编辑器为vscode或者notepad++)

vscode将一台电脑上的配置同步到另外一台电脑上

Tomcat下载安装以及配置（详细教程）

Todesk 远程工具安装及配置方法(认真看)

2023什么电脑配置适合机器学习和人工智能

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

Systematic Review 学习（一）初识

NLP：《NLP Year in Review 2019&NLP_2019_Highlights》2019年自然语言处理领域重要进展回顾及其解读

html没保存电脑蓝屏文件损坏,缺失bootsafe64_ev.sys蓝屏解决方法

windows 打开控制面板及任务管理器快捷键

缺氧游戏计算机,缺氧PC最低什么配置一览你觉得高吗

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载