

|   | Chapter 4 – Global and Detailed Placement                                                                                                                                                 |
|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|   |                                                                                                                                                                                           |
| 2 | 4.1 Introduction                                                                                                                                                                          |
| 2 | 1.2 Optimization Objectives                                                                                                                                                               |
| 2 | <ul> <li>4.3 Global Placement</li> <li>4.3.1 Min-Cut Placement</li> <li>4.3.2 Analytic Placement</li> <li>4.3.3 Simulated Annealing</li> <li>4.3.4 Modern Placement Algorithms</li> </ul> |
| 2 | 4.4 Legalization and Detailed Placement                                                                                                                                                   |

#### 4.1 Introduction





VLSI Physical Design: From Graph Partitioning to Timing Closure









VLSI Physical Design: From Graph Partitioning to Timing Closure

#### 4.2 Optimization Objectives – Total Wirelength

#### Wirelength estimation for a given placement (cont'd.)



**Optimization Objectives – Total Wirelength** 

## Wirelength estimation for a given placement (cont'd.)

Preferred method: Half-perimeter wirelength (HPWL)

• Fast (order of magnitude faster than RSMT)

4.2

- Equal to length of RSMT for 2- and 3-pin nets
- Margin of error for real circuits approx. 8% [Chu, ICCAD 04]



#### 4.2 Optimization Objectives – Total Wirelength

Total wirelength with net weights (weighted wirelength)

• For a placement P, an estimate of total weighted wirelength is

$$L(P) = \sum_{net \in P} w(net) \cdot L(net)$$

where w(net) is the weight of net, and L(net) is the estimated wirelength of net.

• Example:

NetsWeights $N_1 = (a_1, b_1, d_2)$  $w(N_1) = 2$  $N_2 = (c_1, d_1, f_1)$  $w(N_2) = 4$  $N_3 = (e_1, f_2)$  $w(N_3) = 1$ 



$$L(P) = \sum_{net \in P} w(net) \cdot L(net) = 2 \cdot 7 + 4 \cdot 4 + 1 \cdot 3 = 33$$

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 11

#### 4.2 Optimization Objectives – Number of Cut Nets

#### Cut sizes of a placement

• To improve total wirelength of a placement *P*, separately calculate the number of crossings of global vertical and horizontal cutlines, and minimize

$$L(P) = \sum_{v \in V_P} \Psi_P(v) + \sum_{h \in H_P} \Psi_P(h)$$

where  $\Psi_P(cut)$  be the set of nets cut by a cutline *cut* 

#### 4.2 Optimization Objectives – Number of Cut Nets



VLSI Physical Design: From Graph Partitioning to Timing Closure



#### 4.2 Optimization Objectives – Wire Congestion

#### Routing congestion of a placement

- Ratio of demand for routing tracks to the supply of available routing tracks
- Estimated by the number of nets that pass through the boundaries of individual routing regions



Routing congestion of a placement

 Formally, the local wire density φ<sub>P</sub>(e) of an edge e between two neighboring grid cells is

$$\varphi_P(e) = \frac{\eta_P(e)}{\sigma_P(e)}$$

where  $\eta_{P}(e)$  is the estimated number of nets that cross *e* and  $\sigma_{P}(e)$  is the maximum number of nets that can cross *e* 

 If φ<sub>P</sub>(e) > 1, then too many nets are estimated to cross e, making P more likely to be unroutable.

• The wire density of *P* is 
$$\Phi(P) = \max_{e \in E} (\varphi_P(e))$$

where *E* is the set of all edges

 If Φ(P) ≤ 1, then the design is estimated to be fully routable, otherwise routing will need to detour some nets through less-congested edges

VLSI Physical Design: From Graph Partitioning to Timing Closure







#### **Global Placement**

#### • Partitioning-based algorithms:

- The netlist and the layout are divided into smaller sub-netlists and sub-regions, respectively
- Process is repeated until each sub-netlist and sub-region is small enough to be handled optimally
- Detailed placement often performed by optimal solvers, facilitating a natural transition from global placement to detailed placement
- Example: min-cut placement
- Analytic techniques:
  - Model the placement problem using an objective (cost) function, which can be optimized via numerical analysis
  - Examples: quadratic placement and force-directed placement
- Stochastic algorithms:
  - Randomized moves that allow hill-climbing are used to optimize the cost function
  - Example: simulated annealing

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 19



\_SI Physical Design: From Graph Partitioning to Timing Closure

#### 4.3.1 Min-Cut Placement

- Uses partitioning algorithms to divide (1) the netlist and (2) the layout region into smaller sub-netlists and sub-regions
- Conceptually, each sub-region is assigned a portion of the original netlist
- Each cut heuristically minimizes the number of cut nets using, for example,
  - Kernighan-Lin (KL) algorithm
  - Fiduccia-Mattheyses (FM) algorithm

VLSI Physical Design: From Graph Partitioning to Timing Closure



#### 4.3.1 Min-Cut Placement

**Input:** netlist *Netlist*, layout area *LA*, minimum number of cells per region *cells\_min* **Output:** placement *P* 

P = Ø regions = ASSIGN(Netlist,LA) while (regions != Ø) region = FIRST\_ELEMENT(regions) REMOVE(regions, region) if (region contains more than cell\_min cells) (sr1,sr2) = BISECT(region)

ADD\_TO\_END(regions,sr1) ADD\_TO\_END(regions,sr2) else PLACE(region) ADD(P,region) // assign netlist to layout area
// while regions still not placed
// first element in regions
// remove first element of regions

// divide *region* into two subregions
// *sr*1 and *sr*2, obtaining the sub// netlists and sub-areas
// add *sr*1 to the end of *regions*// add *sr*2 to the end of *regions*

// place *region* // add *region* to *P* 

VLSI Physical Design: From Graph Partitioning to Timing Closure







#### 4.3.1 Min-Cut Placement – Terminal Propagation



#### Terminal Propagation

- External connections are represented by artificial connection points on the cutline
- Dummy nodes in hypergraphs



VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 2

#### 4.3.1 Min-Cut Placement

- Advantages:
  - Reasonably fast
  - Objective function can be adjusted, e.g., to perform timing-driven placement
  - Hierarchical strategy applicable to large circuits
- Disadvantages:
  - Randomized, chaotic algorithms small changes in input lead to large changes in output
  - Optimizing one cutline at a time may result in routing congestion elsewhere

 Objective function is quadratic; sum of (weighted) squared Euclidean distance represents placement objective function

$$L(P) = \frac{1}{2} \sum_{i,j=1}^{n} c_{ij} \left( (x_i - x_j)^2 + (y_i - y_j)^2 \right)$$

where *n* is the total number of cells, and c(i,j) is the connection cost between cells *i* and *j*.

- Only two-point-connections
- Minimize objective function by equating its derivative to zero which reduces to solving a system of linear equations

VLSI Physical Design: From Graph Partitioning to Timing Closure

4.3.2 Analytic Placement – Quadratic Placement Similar to Least-Mean-Square Method (root mean square) Build error function with analytic form:  $E(a,b) = \sum (a \cdot x_i + b - y_i)^2$ • 7000 y = 1.0001x - 0.3943 6500 r<sup>2</sup> = 0.69\*\* RMSE = 375.5 6000 5500 5000 4500 4000 3500 3000 3000 3500 4000 4500 5000 5500 6000 6500 7000

$$L(P) = \frac{1}{2} \sum_{i,j=1}^{n} c_{ij} \left( \left( x_i - x_j \right)^2 + \left( y_i - y_j \right)^2 \right)$$

where n is the total number of cells, and c(i,j) is the connection cost between cells i and j.

• Each dimension can be considered independently:

$$L_x(P) = \sum_{i=1, j=1}^n c(i, j)(x_i - x_j)^2 \qquad L_y(P) = \sum_{i=1, j=1}^n c(i, j)(y_i - y_j)^2$$

- Convex quadratic optimization problem: any local minimum solution is also a global minimum
- Optimal x- and y-coordinates can be found by setting the partial derivatives of L<sub>x</sub>(P) and L<sub>y</sub>(P) to zero

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 31

#### 4.3.2 Analytic Placement – Quadratic Placement

$$L(P) = \frac{1}{2} \sum_{i,j=1}^{n} c_{ij} \left( \left( x_i - x_j \right)^2 + \left( y_i - y_j \right)^2 \right)$$

where n is the total number of cells, and c(i,j) is the connection cost between cells i and j.

Each dimension can be considered independently:

$$L_x(P) = \sum_{i=1, j=1}^n c(i, j)(x_i - x_j)^2 \qquad L_y(P) = \sum_{i=1, j=1}^n c(i, j)(y_i - y_j)^2 \\ \frac{\partial L_x(P)}{\partial X} = AX - b_x = 0 \qquad \frac{\partial L_y(P)}{\partial Y} = AY - b_y = 0$$

where A is a matrix with A[i][j] = -c(i,j) when  $i \neq j$ , and A[i][i] = the sum of incident connection weights of cell *i*. X is a vector of all the x-coordinates of the non-fixed cells, and  $b_x$  is a vector with  $b_x[i] =$  the sum of x-coordinates of all fixed cells attached to *i*.

*Y* is a vector of all the *y*-coordinates of the non-fixed cells, and  $b_y$  is a vector with  $b_y[i]$  = the sum of *y*-coordinates of all fixed cells attached to *i*.

VLSI Physical Design: From Graph Partitioning to Timing Closure

$$L(P) = \frac{1}{2} \sum_{i,j=1}^{n} c_{ij} \left( \left( x_i - x_j \right)^2 + \left( y_i - y_j \right)^2 \right)$$

where n is the total number of cells, and c(i,j) is the connection cost between cells i and j.

Each dimension can be considered independently:



 System of linear equations for which iterative numerical methods can be used to find a solution

VLSI Physical Design: From Graph Partitioning to Timing Closure



- Second stage of quadratic placers: cells are spread out to remove overlaps
- Methods:
  - Adding fake nets that pull cells away from dense regions toward anchors
  - Geometric sorting and scaling
  - Repulsion forces, etc.





VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 35

#### 4.3.2 Analytic Placement – Quadratic Placement

- Advantages:
  - Captures the placement problem concisely in mathematical terms
  - Leverages efficient algorithms from numerical analysis and available software
  - Can be applied to large circuits without netlist clustering (flat)
  - Stability: small changes in the input do not lead to large changes in the output
- Disadvantages:
  - Connections to fixed objects are necessary: I/O pads, pins of fixed macros, etc.

#### 4.3.2 Analytic Placement – Force-directed Placement

 Cells and wires are modeled using the mechanical analogy of a mass-spring system, i.e., masses connected to Hooke's-Law springs



- Attraction force between cells is directly proportional to their distance
- Cells will eventually settle in a force equilibrium → minimized wirelength

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 37

#### 4.3.2 Analytic Placement – Force-directed Placement

• Given two connected cells *a* and *b*, the attraction force  $\overrightarrow{F_{ab}}$  exerted on *a* by *b* is

$$\overrightarrow{F_{ab}} = c(a,b) \cdot (\overrightarrow{b} - \overrightarrow{a})$$

where

- c(a,b) is the connection weight (priority) between cells a and b, and
- $(\vec{b} \vec{a})$  is the vector difference of the positions of *a* and *b* in the Euclidean plane
- The sum of forces exerted on a cell *i* connected to other cells 1... *j* is

$$\overrightarrow{F_i} = \sum_{c(i,j)\neq 0} \overrightarrow{F_{ij}}$$

• Zero-force target (ZFT): position that minimizes this sum of forces





#### 4.3.2 Analytic Placement – Force-directed Placement

#### Example: ZFT position

Given:

- Circuit with NAND gate 1 and four I/O pads on a 3 x 3 grid
- Pad positions: In1 (2,2), In2 (0,2), In3 (0,0), Out (2,0)
- Weighted connections: c(a,ln1) = 8, c(a,ln2) = 10, c(a,ln3) = 2, c(a,Out) = 2

Task: find the ZFT position of cell a





VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 41

#### 4.3.2 Analytic Placement – Force-directed Placement

# Example: ZFT position Given: • Circuit with NAND gate 1 and four I/O pads on a 3 x 3 grid • Pad positions: In1 (2,2), In2 (0,2), In3 (0,0), Out (2,0) Solution: $x_a^0 = \frac{\sum_{c(i,j)\neq 0} c(a,j) \cdot x_j^0}{\sum_{c(i,j)\neq 0} c(a,j)} = \frac{c(a,ln1) \cdot x_{ln1} + c(a,ln2) \cdot x_{ln2} + c(a,ln3) \cdot x_{ln3} + c(a,Out) \cdot x_{Out}}{c(a,ln1) + c(a,ln2) + c(a,ln3) + c(a,Out)} = \frac{8 \cdot 2 + 10 \cdot 0 + 2 \cdot 0 + 2 \cdot 2}{8 + 10 + 2 + 2} = \frac{20}{22} \approx 0.9$ $y_a^0 = \frac{\sum_{c(i,j)\neq 0} c(a,j)}{\sum_{c(i,j)\neq 0} c(a,j)} = \frac{c(a,ln1) \cdot y_{ln1} + c(a,ln2) \cdot y_{ln2} + c(a,ln3) \cdot y_{ln3} + c(a,Out) \cdot y_{Out}}{c(a,ln1) + c(a,ln2) + c(a,ln3) + c(a,Out)} = \frac{8 \cdot 2 + 10 \cdot 2 + 2 \cdot 0 + 2 \cdot 0}{8 + 10 + 2 + 2} = \frac{36}{22} \approx 1.6$ ZFT position of cell a is (1,2)



#### Example: ZFT position

Given:

- Circuit with NAND gate 1 and four I/O pads on a 3 x 3 grid
- Pad positions: In1 (2,2), In2 (0,2), In3 (0,0), Out (2,0)

Solution:



#### 4.3.2 **Analytic Placement – Force-directed Placement**

Input: set of all cells V Output: placement P P = PLACE(V)loc = LOCATIONS(P)foreach (cell  $c \in V$ ) status[c] = UNMOVED while (ALL\_MOVED(V) || !STOP()) *c* = MAX\_DEGREE(*V*,*status*) *ZFT\_pos* = *ZFT\_POSITION(c)* **if**  $(loc[ZFT_pos] == \emptyset)$ 

 $loc[ZFT_pos] = c$ else RELOCATE(*c*,*loc*) status[c] = MOVED

// arbitrary initial placement // set coordinates for each cell in P

// continue until all cells have been

- // moved or some stopping
- // criterion is reached
- // unmoved cell that has largest
- // number of connections
- // ZFT position of c
- // if position is unoccupied,
- // move *c* to its ZFT position

// use methods discussed next // mark c as moved

#### 4.3.2 Analytic Placement – Force-directed Placement

Finding a valid location for a cell with an occupied ZFT position (*p:* incoming cell, *q*: cell in *p*'s ZFT position)

- If possible, move *p* to a cell position close to *q*.
- Chain move: cell *p* is moved to cells *q*'s location.
  - Cell q, in turn, is shifted to the next position. If a cell r is occupying this space, cell r is shifted to the next position.
  - This continues until all affected cells are placed.
- Compute the cost difference if *p* and *q* were to be swapped.
   If the total cost reduces, i.e., the weighted connection length *L*(*P*) is smaller, then swap *p* and *q*.

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 45

#### 4.3.2 Analytic Placement – Force-directed Placement (Example)

Given:

| Nets               | Weight       |
|--------------------|--------------|
| $N_1 = (b_1, b_3)$ | $c(N_1) = 2$ |
| $N_2 = (b_2, b_3)$ | $c(N_2) = 1$ |



#### 4.3.2 Analytic Placement – Force-directed Placement (Example)





#### 4.3.2 Analytic Placement – Force-directed Placement

- Advantages:
  - Conceptually simple, easy to implement
  - Primarily intended for global placement, but can also be adapted to detailed placement
- Disadvantages:
  - Does not scale to large placement instances
  - Is not very effective in spreading cells in densest regions
  - Poor trade-off between solution quality and runtime
- In practice, FDP is extended by specialized techniques for cell spreading
  - This facilitates scalability and makes FDP competitive

VLSI Physical Design: From Graph Partitioning to Timing Closure



#### 4.3.3 Simulated Annealing – Algorithm

Input: set of all cells V Output: placement P

```
T = T_0

P = PLACE(V)

while (T > T_{min})

while (!STOP())

new_P = PERTURB(P)

\Delta cost = COST(new_P) - COST(P)

if (\Delta cost < 0)

P = new_P

else

r = RANDOM(0,1)

if (r < e^{-\Delta cost/T})

P = new_P

T = \alpha \cdot T
```

// set initial temperature
// arbitrary initial placement
// not yet in equilibrium at *T*// cost improvement
// accept new placement
// no cost improvement
// random number [0,1)
// probabilistically accept
// reduce *T*, 0 < α < 1</pre>

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 51

#### 4.3.3 Simulated Annealing

- Advantages:
  - Can find global optimum (given sufficient time)
  - Well-suited for detailed placement
- Disadvantages:
  - Very slow
  - To achieve high-quality implementation, laborious parameter tuning is necessary
  - Randomized, chaotic algorithms small changes in the input lead to large changes in the output
- Practical applications of SA:
  - Very small placement instances with complicated constraints
  - Detailed placement, where SA can be applied in small windows (not common anymore)
  - FPGA layout, where complicated constraints are becoming a norm

#### 4.3.3 Simulated Annealing



- Predominantly analytic algorithms
- Solve two challenges: interconnect minimization and cell overlap removal (spreading)
- Two families:









#### 4.4 Legalization and Detailed Placement

- Global placement must be legalized
  - Cell locations typically do not align with power rails
  - Small cell overlaps due to incremental changes, such as cell resizing or buffer insertion
- Legalization seeks to find legal, non-overlapping placements for all placeable modules
- Legalization can be improved by detailed placement techniques, such as
  - Swapping neighboring cells to reduce wirelength
  - Sliding cells to unused space
- Software implementations of legalization and detailed placement are often bundled

VLSI Physical Design: From Graph Partitioning to Timing Closure



#### Summary of Chapter 4 – Problem Formulation and Objectives

- Row-based standard-cell placement
  - Cell heights are typically fixed, to fit in rows (but some cells may have double and quadruple heights)
  - Legal cell sites facilitate the alignment of routing tracks, connection to power and ground rails
- Wirelength as a key metric of interconnect
  - Bounding box half-perimeter (HPWL)
  - Cliques and stars
  - RMSTs and RSMTs
- Objectives: wirelength, routing congestion, circuit delay
  - Algorithm development is usually driven by wirelength
  - The basic framework is implemented, evaluated and made competitive on standard benchmarks
  - Additional objectives are added to an operational framework

VLSI Physical Design: From Graph Partitioning to Timing Closure

Chapter 4: Global and Detailed Placement 61

#### Summary of Chapter 4 – Global Placement

- · Combinatorial optimization techniques: min-cut and simulated annealing
  - Can perform both global and detailed placement
  - Reasonably good at small to medium scales
  - SA is very slow, but can handle a greater variety of constraints
  - Randomized and chaotic algorithms small changes at the input can lead to large changes at the output
- Analytic techniques: force-directed placement and non-convex optimization
  - Primarily used for global placement
  - Unrivaled for large netlists in speed and solution quality
  - Capture the placement problem by mathematical optimization
  - Use efficient numerical analysis algorithms
  - Ensure stability: small changes at the input can cause only small changes at the output
  - Example: a modern, competitive analytic global placer takes 20mins for global placement of a netlist with 2.1M cells (single thread, 3.2GHz Intel CPU)

#### Summary of Chapter 4 – Legalization and Detailed Placement

- Legalization ensures that design rules & constraints are satisfied
  - All cells are in rows
  - Cells align with routing tracks
  - Cells connect to power & ground rails
  - Additional constraints are often considered, e.g., maximum cell density
- Detailed placement reduces interconnect, while preserving legality
  - Swapping neighboring cells, rotating groups of three
  - Optimal branch-and-bound on small groups of cells
  - Sliding cells along their rows
  - Other local changes
- Extensions to optimize routed wirelength, routing congestion and circuit timing
- Relatively straightforward algorithms, but high-quality, fast implementation is important
- Most relevant after analytic global placement, but are also used after min-cut placement
- Rule of thumb: 50% runtime is spent in global placement, 50% in detailed placement <sup>[1]</sup>

VLSI Physical Design: From Graph Partitioning to Timing Closure