



**GPU** Teaching Kit

#### Module 4.3 - Memory Model and Locality

## Objective

- To understand the design of a tiled parallel algorithm for matrix multiplication
  - Loading a tile
  - Phased execution
  - Barrier Synchronization



# **Matrix Multiplication**



## **Tiled Matrix Multiplication**

- Break up the execution of each thread into phases
- so that the data accesses by the thread block in each phase are focused on one tile of M and one tile of N
- The tile is of BLOCK\_SIZE elements in each dimension

Row



Ν

### Loading a Tile

- All threads in a block participate
  - Each thread loads one M element and one N element in tiled code

## Phase 0 Load for Block (0,0)

| N <sub>0,0</sub> N <sub>0,1</sub> N <sub>0,2</sub> N <sub>0,3</sub><br>N <sub>1,0</sub> N <sub>1,1</sub> N <sub>1,2</sub> N <sub>1,3</sub> |            | N <sub>0,0</sub> N <sub>0,1</sub>                                              | Shared Memory                                                          |
|--------------------------------------------------------------------------------------------------------------------------------------------|------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------|
| $\begin{array}{c c c c c c c c c c c c c c c c c c c $                                                                                     | ared Memor | у                                                                              |                                                                        |
|                                                                                                                                            |            | $\begin{array}{c c} P_{0,0} & P_{0,1} \\ \hline P_{1,0} & P_{1,1} \end{array}$ | $\frac{P_{0,2}}{P_{1,2}} \frac{P_{0,3}}{P_{1,3}}$                      |
| $\begin{array}{c c} M_{2,0} & M_{2,1} & M_{2,2} & M_{2,3} \\ \hline M_{3,0} & M_{3,1} & M_{3,2} & M_{3,3} \end{array}$                     |            |                                                                                | P <sub>2,2</sub> P <sub>2,3</sub><br>P <sub>3,2</sub> P <sub>3,3</sub> |

## Phase 0 Use for Block (0,0) (iteration 0)



## Phase 0 Use for Block (0,0) (iteration 1)

| $\frac{N_{0,0}}{N_{1,0}} \frac{N_{0,1}}{N_{1,1}}$ | $\frac{N_{0,2}}{N_{1,2}} \frac{N_{0,3}}{N_{1,3}}$                              |                                   | N <sub>0,0</sub><br>N <sub>1,0</sub> | N <sub>0,1</sub><br>N <sub>1,1</sub> | Sha              | ared I                               | Memory |
|---------------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------|--------------------------------------|--------------------------------------|------------------|--------------------------------------|--------|
| $\frac{N_{2,0}}{N_{3,0}} \frac{N_{2,1}}{N_{3,1}}$ | $\frac{N_{2,2}}{N_{3,2}} \frac{N_{2,3}}{N_{3,3}}$                              | Shared Memor                      | 7                                    | Π                                    | -                |                                      |        |
|                                                   | $\begin{array}{c c} M_{0,2} & M_{0,3} \\ \hline M_{1,2} & M_{1,3} \end{array}$ | $M_{0,0} M_{0,1} M_{1,0} M_{1,1}$ | ن<br>ورف<br>ک<br>۱٫۵                 | <b>2</b> 01                          |                  | P <sub>0,3</sub><br>P <sub>1,3</sub> |        |
| $\frac{M_{2,0}}{M_{3,0}} \frac{M_{2,1}}{M_{3,1}}$ | M <sub>2,2</sub> M <sub>2,3</sub><br>M <sub>3,2</sub> M <sub>3,3</sub>         |                                   | P <sub>2,0</sub><br>P <sub>3,0</sub> |                                      | P <sub>2,2</sub> | P <sub>2,3</sub>                     |        |

## Phase 1 Load for Block (0,0)



Phase 1 Use for Block (0,0) (iteration 0)

| $\begin{array}{c c c c c c c c c c c c c c c c c c c $ | $\frac{N_{2 0} N_{2,1}}{N_{3 0} N_{3,1}}$ Shared Memory<br>Shared Memory |
|--------------------------------------------------------|--------------------------------------------------------------------------|
| $\begin{array}{c c c c c c c c c c c c c c c c c c c $ | $\begin{array}{c ccccccccccccccccccccccccccccccccccc$                    |

# Phase 1 Use for Block (0,0) (iteration 1)



💿 INVIDIA 🛛 🚺 ILLINO:

## **Execution Phases of Toy Example**

|                       | Phase 0                                                                             |                                                                                     | Phase 1                                                                                          |                                                            |                                                                                     |                                                                                                    |
|-----------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| thread <sub>0,0</sub> | $egin{array}{c} \mathbf{M}_{0,0} \ \downarrow \ \mathbf{Mds}_{0,0} \end{array}$     | $egin{array}{c} \mathbf{N}_{0,0} \ \downarrow \ \mathbf{Nds}_{0,0} \end{array}$     | $\begin{array}{l} PValue_{0,0} += \\ Mds_{0,0}*Nds_{0,0} + \\ Mds_{0,1}*Nds_{1,0} \end{array}$   | M <sub>0,2</sub><br>↓<br>Mds <sub>0,0</sub>                | $\begin{matrix} \mathbf{N_{2,0}} \\ \downarrow \\ \mathbf{Nds}_{0,0} \end{matrix}$  | $\begin{array}{l} PValue_{0,0} += \\ Mds_{0,0}*Nds_{0,0} + \\ Mds_{0,1}*Nds_{1,0} \end{array}$     |
| thread <sub>0,1</sub> | $egin{array}{c} \mathbf{M}_{0,1} \ \downarrow \ \mathbf{Mds}_{0,1} \end{array}$     | $egin{array}{c} \mathbf{N_{0,1}} \ \downarrow \ \mathbf{Nds_{1,0}} \end{array}$     | $\begin{array}{l} PValue_{0,1} += \\ Mds_{0,0}*Nds_{0,1} + \\ Mds_{0,1}*Nds_{1,1} \end{array}$   | M <sub>0,3</sub><br>↓<br>Mds <sub>0,1</sub>                | $\mathbf{N}_{2,1}$<br>$\downarrow$<br>$\mathrm{Nds}_{0,1}$                          | $\begin{array}{l} PValue_{0,1} += \\ Mds_{0,0}*Nds_{0,1} + \\ Mds_{0,1}*Nds_{1,1} \end{array}$     |
| thread <sub>1,0</sub> | $\begin{array}{c} \mathbf{M_{1,0}} \\ \downarrow \\ \mathbf{Mds_{1,0}} \end{array}$ | $\begin{array}{c} \mathbf{N_{1,0}} \\ \downarrow \\ \mathbf{Nds}_{1,0} \end{array}$ | $\begin{array}{l} PValue_{1,0} += \\ Mds_{1,0}*Nds_{0,0} + \\ Mds_{1,1}*Nds_{1,0} \end{array}$   | $\mathbf{M_{1,2}} \\\downarrow \\ \mathbf{Mds}_{1,0}$      | $\begin{array}{c} \mathbf{N_{3,0}} \\ \downarrow \\ \mathbf{Nds}_{1,0} \end{array}$ | $\begin{array}{l} PValue_{1,0} += \\ Mds_{1,0} * Nds_{0,0} + \\ Mds_{1,1} * Nds_{1,0} \end{array}$ |
| thread <sub>1,1</sub> | $\mathbf{M}_{1,1} \\ \downarrow \\ Mds_{1,1}$                                       | $\begin{array}{c} \mathbf{N_{1,1}} \\ \downarrow \\ \mathbf{Nds}_{1,1} \end{array}$ | $\begin{array}{l} PValue_{1,1} += \\ Mds_{1,0}^*Nds_{0,1} + \\ Mds_{1,1}^*Nds_{1,1} \end{array}$ | $\mathbf{M}_{1,3}$<br>$\downarrow$<br>$\mathrm{Mds}_{1,1}$ | $\begin{array}{c} \mathbf{N_{3,1}} \\ \downarrow \\ \mathrm{Nds}_{1,1} \end{array}$ | $\begin{array}{l} PValue_{1,1} += \\ Mds_{1,0}^*Nds_{0,1} + \\ Mds_{1,1}^*Nds_{1,1} \end{array}$   |

time

## Execution Phases of Toy Example (cont.)

|                       | Phase 0                                                                             |                                                                                     | Phase 1                                                                                            |                                                            |                                                                                    |                                                                                                    |
|-----------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| thread <sub>0,0</sub> | $\begin{array}{c} \mathbf{M_{0,0}} \\ \downarrow \\ \mathbf{Mds_{0,0}} \end{array}$ | $egin{array}{c} \mathbf{N}_{0,0} \ \downarrow \ \mathrm{Nds}_{0,0} \end{array}$     | $\begin{array}{l} PValue_{0,0} += \\ Mds_{0,0} *Nds_{0,0} + \\ Mds_{0,1} *Nds_{1,0} \end{array}$   | $\mathbf{M}_{0,2}$<br>$\downarrow$<br>$\mathrm{Mds}_{0,0}$ | $\begin{matrix} \mathbf{N_{2,0}} \\ \downarrow \\ \mathbf{Nds}_{0,0} \end{matrix}$ | $\begin{array}{l} PValue_{0,0} += \\ Mds_{0,0}^*Nds_{0,0} + \\ Mds_{0,1}^*Nds_{1,0} \end{array}$   |
| thread <sub>0,1</sub> | $\begin{matrix} \mathbf{M_{0,1}} \\ \downarrow \\ \mathbf{Mds}_{0,1} \end{matrix}$  | $\begin{array}{c} \mathbf{N_{0,1}} \\ \downarrow \\ \mathbf{Nds_{1,0}} \end{array}$ | $\begin{array}{l} PValue_{0,1} += \\ Mds_{0,0} * Nds_{0,1} + \\ Mds_{0,1} * Nds_{1,1} \end{array}$ | $\mathbf{M}_{0,3}$<br>$\downarrow$<br>$\mathrm{Mds}_{0,1}$ | $\begin{matrix} \mathbf{N_{2,1}} \\ \downarrow \\ \mathbf{Nds}_{0,1} \end{matrix}$ | $\begin{array}{l} PValue_{0,1} += \\ Mds_{0,0}*Nds_{0,1} + \\ Mds_{0,1}*Nds_{1,1} \end{array}$     |
| thread <sub>1,0</sub> | $\begin{array}{c} \mathbf{M_{1,0}} \\ \downarrow \\ \mathbf{Mds_{1,0}} \end{array}$ | $\begin{array}{c} \mathbf{N_{1,0}} \\ \downarrow \\ Nds_{1,0} \end{array}$          | $\begin{array}{l} PValue_{1,0} += \\ Mds_{1,0}*Nds_{0,0} + \\ Mds_{1,1}*Nds_{1,0} \end{array}$     | $\mathbf{M}_{1,2}$<br>$\downarrow$<br>$\mathrm{Mds}_{1,0}$ | $\begin{matrix} \mathbf{N_{3,0}} \\ \downarrow \\ \mathbf{Nds}_{1,0} \end{matrix}$ | $\begin{array}{l} PValue_{1,0} += \\ Mds_{1,0} * Nds_{0,0} + \\ Mds_{1,1} * Nds_{1,0} \end{array}$ |
| thread <sub>1,1</sub> | $\begin{array}{c} \mathbf{M_{1,1}} \\ \downarrow \\ \mathbf{Mds_{1,1}} \end{array}$ | $\begin{array}{c} \mathbf{N_{1,1}} \\ \downarrow \\ \mathbf{Nds}_{1,1} \end{array}$ | $\begin{array}{l} PValue_{1,1} += \\ Mds_{1,0}*Nds_{0,1} + \\ Mds_{1,1}*Nds_{1,1} \end{array}$     | $\mathbf{M}_{1,3} \\\downarrow \\ \mathbf{Mds}_{1,1}$      | $\begin{matrix} \mathbf{N_{3,1}} \\ \downarrow \\ \mathbf{Nds}_{1,1} \end{matrix}$ | $\begin{array}{l} PValue_{1,1} += \\ Mds_{1,0} * Nds_{0,1} + \\ Mds_{1,1} * Nds_{1,1} \end{array}$ |

time

Shared memory allows each value to be accessed by multiple threads

## **Barrier Synchronization**

- Synchronize all threads in a block
   \_\_\_syncthreads()
- All threads in the same block must reach the \_\_\_\_syncthreads() before any of the them can move on
- Best used to coordinate the phased execution tiled algorithms
  - To ensure that all elements of a tile are loaded at the beginning of a phase
  - To ensure that all elements of a tile are consumed at the end of a phase







#### Module 4.4 - Memory and Data Locality

**Tiled Matrix Multiplication Kerne** 

## Objective

- To learn to write a tiled matrix-multiplication kernel
  - Loading and using tiles for matrix multiplication
  - Barrier synchronization, shared memory
  - Resource Considerations
  - Assume that Width is a multiple of tile size for simplicity

### Loading Input Tile 0 of M (Phase 0)



### Loading Input Tile 0 of N (Phase 0)



### Loading Input Tile 1 of M (Phase 1)



### Loading Input Tile 1 of N (Phase 1)



#### M and N are dynamically allocated - use 1D indexing



M[Row][p\*TILE\_WIDTH+tx] M[Row\*Width + p\*TILE\_WIDTH + tx]



N[p\*TILE\_WIDTH+ty][Col] N[(p\*TILE\_WIDTH+ty)\*Width + Col]

where p is the sequence number of the current phase



### **Tiled Matrix Multiplication Kernel**

\_global\_\_ void MatrixMulKernel(float\* M, float\* N, float\* P, Int Width)

```
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
 ___shared__ float ds_N[TILE_WIDTH][TILE WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * blockDim.y + ty;
 int Col = bx * blockDim.x + tx;
 float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < n/TILE WIDTH; ++p) {
   // Collaborative loading of M and N tiles into shared memory
   ds M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
   ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
   syncthreads();
   for (int i = 0; i < TILE WIDTH; ++i)Pvalue += ds M[ty][i] * ds N[i][tx];
   ____synchthreads();
 }
 P[Row*Width+Col] = Pvalue;
}
```

### **Tiled Matrix Multiplication Kernel**

```
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
 ___shared__ float ds_M[TILE_WIDTH][TILE WIDTH];
 ___shared__ float ds_N[TILE_WIDTH][TILE WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * blockDim.y + ty;
 int Col = bx * blockDim.x + tx;
 float Pvalue = 0;
 // Loop over the M and N tiles required to compute the P element
for (int p = 0; p < n/TILE WIDTH; ++p) {
   // Collaborative loading of M and N tiles into shared memory
   ds M[ty][tx] = M[Row*Width + p*TILE WIDTH+tx];
   ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
    syncthreads();
   for (int i = 0; i < TILE WIDTH; ++i)Pvalue += ds M[ty][i] * ds N[i][tx];
   ____synchthreads();
 P[Row*Width+Col] = Pvalue;
```

23

P[Row\*Width+Col] = Pvalue;

### **Tiled Matrix Multiplication Kernel**

```
_global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
  __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
  ___shared__ float ds_N[TILE_WIDTH][TILE WIDTH];
  int bx = blockIdx.x; int by = blockIdx.y;
  int tx = threadIdx.x; int ty = threadIdx.y;
  int Row = by * blockDim.y + ty;
  int Col = bx * blockDim.x + tx;
  float Pvalue = 0;
 // Loop over the M and N tiles required to compute the P element
 for (int p = 0; p < n/TILE WIDTH; ++p) {
    // Collaborative loading of M and N tiles into shared memory
    ds M[ty][tx] = M[Row*Width + p*TILE WIDTH+tx];
    ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
     syncthreads();
    for (int i = 0; i < TILE WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];</pre>
      synchthreads();
```

## Tile (Thread Block) Size Considerations

- Each thread block should have many threads
  - TILE\_WIDTH of 16 gives 16\*16 = 256 threads
  - TILE\_WIDTH of 32 gives 32\*32 = 1024 threads
- For 16, in each phase, each block performs 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations. (16 floating-point operations for each memory load)
- For 32, in each phase, each block performs 2\*1024 = 2048 float loads from global memory for 1024 \* (2\*32) = 65,536 mul/add operations. (32 floating-point operation for each memory load)

💿 nvidia 💦 🚺 illinois

# **Shared Memory and Threading**

- For an SM with 16KB shared memory
  - Shared memory size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - For 16KB shared memory, one can potentially have up to 8 thread blocks executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4 Byte= 8K Byte shared memory usage per thread block, allowing 2 thread blocks active at the same time
    - However, in a GPU where the thread count is limited to 1536 threads per SM, the number of blocks per SM is reduced to one!
- Each \_\_\_syncthread() can reduce the number of active threads for a block
  - More thread blocks can be advantageous





#### GPU Teaching Kit

Accelerated Computing

#### Module 4.5 - Memory and Data Locality

Handling Arbitrary Matrix Sizes in Tiled Algorithms

## Objective

- To learn to handle arbitrary matrix sizes in tiled matrix multiplication
  - Boundary condition checking
  - Regularizing tile contents
  - Rectangular matrices

## Handling Matrix of Arbitrary Size

- The tiled matrix multiplication kernel we presented so far can handle only square matrices whose dimensions (Width) are multiples of the tile width (TILE\_WIDTH)
  - However, real applications need to handle arbitrary sized matrices.
  - One could pad (add elements to) the rows and columns into multiples of the tile size, but would have significant space and data transfer time overhead.
- We will take a different approach.

Phase 1 Loads for Block (0,0) for a 3x3 Example



Threads (0,1) and (1,1) need special treatment in loading M tile

#### Phase 1 Use for Block (0,0) (iteration 0)



#### Phase 1 Use for Block (0,0) (iteration 1)

| N <sub>0,0</sub> | $N_{0,1}$        | N <sub>0,2</sub> |  |
|------------------|------------------|------------------|--|
| $N_{1,0}$        |                  |                  |  |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N <sub>2,2</sub> |  |
|                  |                  |                  |  |

| $M_{0,0}$ | $M_{0,1}$ | M <sub>0,2</sub> |  |
|-----------|-----------|------------------|--|
| $M_{1,0}$ | $M_{1,1}$ | $M_{1,2}$        |  |
| $M_{2,0}$ | $M_{2,1}$ | M <sub>2,2</sub> |  |
|           |           |                  |  |



All Threads need special treatment. None of them should introduce invalidate contributions to their P elements.

#### Phase 0 Loads for Block (1,1) for a 3x3 Example

Threads (0,1) and (1,1) need special treatment in loading N tile



Threads (1,0) and (1,1) need special treatment in loading M tile

## Major Cases in Toy Example

- Threads that do not calculate valid P elements but still need to participate in loading the input tiles
  - Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent P[3,2] but need to participate in loading tile element N[1,2]
- Threads that calculate valid P elements may attempt to load nonexisting input elements when loading input tiles
  - Phase 0 of Block(0,0), Thread(1,0), assigned to calculate valid P[1,0] but attempts to load non-existing N[3,0]

## A "Simple" Solution

- When a thread is to load any input element, test if it is in the valid index range
  - If valid, proceed to load
  - Else, do not load, just write a 0
- Rationale: a 0 value will ensure that that the multiply-add step does not affect the final value of the output element
- The condition tested for loading input elements is different from the test for calculating output P element
  - A thread that does not calculate valid P element can still participate in loading input tile elements

#### Phase 1 Use for Block (0,0) (iteration 1)

| $\begin{array}{c c c c c c c c c c c c c c c c c c c $                                                                                             | N <sub>2,0</sub> N <sub>2,1</sub> Shared Memory<br>Shared Memory |
|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| $\begin{array}{c c} M_{0,0} & M_{0,1} & M_{0,2} \\ \hline M_{1,0} & M_{1,1} & M_{1,2} \\ \hline M_{2,0} & M_{2,1} & M_{2,2} \\ \hline \end{array}$ | $\begin{array}{c c c c c c c c c c c c c c c c c c c $           |

# **Boundary Condition for Input M Tile**

- Each thread loads
  - M[Row][p\*TILE WIDTH+tx]
  - M[Row\*Width + p\*TILE\_WIDTH+tx]
- Need to test
  - (Row < Width) && (p\*TILE\_WIDTH+tx < Width)</p>
  - If true, load M element
  - Else, load 0



# **Boundary Condition for Input N Tile**

- Each thread loads
  - N[p\*TILE\_WIDTH+ty][Col]
  - N[(p\*TILE\_WIDTH+ty)\*Width+ Col]
- Need to test
  - (p\*TILE\_WIDTH+ty < Width) && (Col< Width)</p>
  - If true, load N element
  - Else, load 0



## Loading Elements – with boundary check

```
8 for (int p = 0; p < (Width-1) / TILE_WIDTH + 1; ++p) {
_
           if(Row < Width && t * TILE_WIDTH+tx < Width) {
_
   ++
   9
               ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
          } else {
   ^{++}
               ds_M[ty][tx] = 0.0;
   ++
           }
   ++
          if (p*TILE_WIDTH+ty < Width && Col < Width) {
   ^{++}
   10
               ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col];
   ^{++}
          } else {
               ds_N[ty][tx] = 0.0;
   ++
_
   ^{++}
          }
   11
          ____syncthreads();
```



## Inner Product – Before and After

- ++ if(Row < Width && Col < Width) {
- for (int i = 0; i < TILE\_WIDTH; ++i) { 12
- 13 Pvalue += ds\_M[ty][i] \* ds\_N[i][tx]; \_
- } \_
- \_\_\_\_syncthreads(); 14 \_
- 15 } /\* end of outer for loop \*/
- ++ if (Row < Width && Col < Width)
- P[Row\*Width + Col] = Pvalue; - 16
- } /\* end of kernel \*/

## **Some Important Points**

- For each thread the conditions are different for
  - Loading M element
  - Loading N element
  - Calculating and storing output elements
- The effect of control divergence should be small for large matrices

## Handling General Rectangular Matrices

In general, the matrix multiplication is defined in terms of rectangular matrices

- A j x k M matrix multiplied with a k x I N matrix results in a j x I P matrix

- We have presented square matrix multiplication, a special case
- The kernel function needs to be generalized to handle general rectangular matrices
  - The Width argument is replaced by three arguments: j, k, l
  - When Width is used to refer to the height of M or height of P, replace it with j
  - When Width is used to refer to the width of M or height of N, replace it with k
  - When Width is used to refer to the width of N or width of P, replace it with I



#### GPU Teaching Kit

Accelerated Computing



The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.