cuda 并行计算 | GPU 编程模型

udacity上的课程,有nvidia的工程师上课,比较基础也比较易懂。


CUDA程序的特点

相比于CPU的单线程串行计算,CUDA程序的多线程对速度提升有很大的作用。
这就是优化时间与优化吞吐量的区别。

  1. 程序编译后分别在CPU和GPU上运行;
  2. CPU是主机(host),GPU是从机(device);
  3. 各自有各自的存储位置,不能相互访问。
  4. GPU不能发起运算,只能相应运算

CUDA程序的执行步骤

  1. CPU在GPU上申请空间
    cudaMalloc(起始地址,大小)
  2. CPU将数据从内存拷贝到显存
    cudaMemcpy(源,目标,大小,方向)
  3. CPU启动GPU上的内核进行计算
    kernel_name <<<blocks,threads>>>(函数参数)
  4. CPU将处理结果从显存拷贝到内存
    cudaMemcpy(源,目标,大小,方向)

kernel 函数

1
2
3
4
5
6
_global_ void square(float *d_in, float *d_out)
{
int idx = threadIdx.x;
float f = d_in[idx];
d_out[idx] = f * f;
}

对于每个kernel,其计算流程类似于串行计算。


Block 与 Thread

在CPU启动GPU上的内核进行计算时,使用的是kernel_name <<<blocks,threads>>>,定义了blocks,threads的大小。这两者可以是1、2或是3D的结构,代表使用多少块,每块线程数目。
数据结构为dim(x,y,z),对于一维w等价于dim3(w)也等价于dim3(w,1,1)
总的线程数为二者的乘积。
对于每个线程,其索引号是比较重要的,访问方法有:

  • threadIdx
  • blockDim
  • blockIdx

等。


映射

是一种数据与方法的关系(其实感觉就是函数)。
Map(Elements, Function)
元素为待处理的数据集合,Function为对每个元素处理的方法。


彩图转灰度图

这是lesson 1的homework,不是很难。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// Homework 1
// Color to Greyscale Conversion
//A common way to represent color images is known as RGBA - the color
//is specified by how much Red, Green, and Blue is in it.
//The 'A' stands for Alpha and is used for transparency; it will be
//ignored in this homework.
//Each channel Red, Blue, Green, and Alpha is represented by one byte.
//Since we are using one byte for each color there are 256 different
//possible values for each color. This means we use 4 bytes per pixel.
//Greyscale images are represented by a single intensity value per pixel
//which is one byte in size.
//To convert an image from color to grayscale one simple method is to
//set the intensity to the average of the RGB channels. But we will
//use a more sophisticated method that takes into account how the eye
//perceives color and weights the channels unequally.
//The eye responds most strongly to green followed by red and then blue.
//The NTSC (National Television System Committee) recommends the following
//formula for color to greyscale conversion:
//I = .299f * R + .587f * G + .114f * B
//Notice the trailing f's on the numbers which indicate that they are
//single precision floating point constants and not double precision
//constants.
//You should fill in the kernel as well as set the block and grid sizes
//so that the entire image is processed.
#include "reference_calc.cpp"
#include "utils.h"
#include <stdio.h>
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
//TODO
//Fill in the kernel to convert from color to greyscale
//the mapping from components of a uchar4 to RGBA is:
// .x -> R ; .y -> G ; .z -> B ; .w -> A
//
//The output (greyImage) at each pixel should be the result of
//applying the formula: output = .299f * R + .587f * G + .114f * B;
//Note: We will be ignoring the alpha channel for this conversion
int ind_x = blockIdx.x;
int ind_y = blockIdx.y;
uchar4 pixel_in = rgbaImage[ind_x * numCols +ind_y];
unsigned char R = pixel_in.x;
unsigned char G = pixel_in.y;
unsigned char B = pixel_in.z;
unsigned char output = .299f * R + .587f * G + .114f * B;
greyImage[ind_x * numCols +ind_y] = output;
//First create a mapping from the 2D block and grid locations
//to an absolute 2D location in the image, then use that to
//calculate a 1D offset
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(1, 1, 1); //TODO
const dim3 gridSize( numRows, numCols, 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

Intro to Parallel Programming