TeallSemi GmbH email: tealsemi@gmail.com HRB: 254216 (Munich) VAT: DE328043994 |
## TealSemi |
EMAIL: tealsemi@gmail.com |
---|

TealSemi GmbH offers help with front-end RTL design at various stages:

- Contribution to RTL design
- Optimizing data path
- Reviews, suggestions for optimizations
- Synthesis
- ... and more ...

Below are a few RTL code examples from recent RTL optimizations. All examples (including a basic test bench) can be downloaded and simulated. The downloaded file include a script for running simulation using Icarus as a simulation tool. The examples can be easily ported to any other Verilog simulator. Examples how to use Xcelium or Questa, please see comments in script run_sim, which is part of each downloaded package.

1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result)

1.2. Mixture of Signed and Unsigned Operations

1.3. Ordering of Multipliers and Adders

1.4. Rounding (e.g. for Q1.31)

1.5. Multiply-Add Operation for Q1.31

1.6. Multiply-Add Operation with Saturation for (Q1.31)

1.7. Pipelined Accumulator (Signed)

1.8. Signed Fixed-Point Dot-Product (Q1.31)

2.1. Transformations to Optimize Datapath

2.2. Integer (16 bits) Multiply Add Sub Function

2.3. Implementation of Function abs(a-b) on a Single Adder

3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left

3.2. SIN Geneation for 18 bits (EXPERIMENTAL)

1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result) |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* Signed integer multiplication executed on a unsigned multiplier */ endmodule |
module test_ppa ( wire signed [47:0] xxl; /* Signed integer multiplication executed on a signed multiplier */ endmodule |

Both sides are formally equivalent, but the code on the right side synthesizes to smaller area and higher speed. Most linting tools will issue a warning for the improved code (on the right side) because the bit width of the operands a and b are not equal. This warning needs to be waived. In terms of area the difference can be significant for multipliers but less significant for adders. The same applies for unsigned operations.

1.2. Mixture of Signed and Unsigned Operations |
---|

RTL Code |
Improved PPA |
---|---|

module test_must (
wire [16:0] ua; /* Arithmetic mapped to unsigned multiplier */ endmodule |
module test_ppa ( wire signed [32:0] xxl;
/* Signed arithmetic mapped to signed multiplier */ endmodule |

Both codes are formally equivalent. It is best to use signed arithmetic if one term of the arithmetic is signed (including the result). All unsigned terms need casting to signed terms. An unsigned multiplier has about the same size as an signed multiplier. The code on the right side is smaller and has less logic depth.

1.3. Ordering of Multipliers and Adders |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* Addition is followed by multiplier */ endmodule |
module test_ppa ( /* Addition is merged into the multiplier (single CSA tree) */ endmodule |

Both codes are formally equivalent. The RTL on the right side has less level of logic and about the same area, because an addition followed by a multiplication (as coded on the left side) cannot be merged into a single Carry-Save-Adder.

1.4. Rounding (e.g. for Q1.31) |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* Rounding is done on an adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [63:0] xxl; /* Rounding is merged into the multiplier (single CSA tree) */ endmodule |

Both codes are formally equivalent. In the RTL on the left side the bit width of the adder is minimized, but the improved RTL on the right side has less level of logic and less area because it can be mapped onto a single Carry-Save-Adder.

1.5. Multiply-Add Operation for Q1.31 |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* MAC operation and rounding is done on a separate adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* MAC operation and rounding are mapped onto a single CSA tree */ endmodule |

Both codes are formally equivalent. As seen in the example above for rounding, the RTL on the right side can be mapped to a single Carry-Save-Adder and significantly improves PPA.

1.6. Multiply-Add Operation with Saturation for (Q1.31) |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Multiply-Add followed by saturation logic */ endmodule |
module test_ppa (
localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff; wire signed [64:0] xxl; /* Control for saturation is mapped into the CSA tree */ endmodule |

The code on both sides is formally equivalent. In many cases of fixed-point arithmetic, saturation logic needs to be added to protect an overflow of the Multiply-Add operation. If the condition of the overflow is using a magnitude comparator (see right side) of the full width (see term xxl) then the control for the multiplexer of the saturtion logic will be merged into the Carry-Save-Adder. This will result in less level of logic. For an ultra low-speed implementation this might cause a slight increase in area.

1.7. Pipelined Accumulator (Signed) |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* register stage: accumulator */ always @( posedge clk or negedge rst_n ) begin endmodule |
module test_ppa ( integer i; wire signed [63:0] aa; /* sign-extend input a */ assign aa = $signed(a); /* Register stage: output of 3-2 compressor (only one single full-adder stage) */ always @( posedge clk or negedge rst_n ) begin /* 32 bit adder is outside critical look and can be pipelined if needed */ assign x = $signed(s) + $signed(c[63:0]); endmodule |

The code above demonstrates how a compressor (Wallacetree) can be used if timing cannot be closed in an accumulator stage. On the left side the input "a" is accumulated every clock cycle to register x (output). If timing cannot be met, one way to add pipeline stages and still accumulate every clock cycle is to register the CSA stage and move the final adder outside the accumulator stage. Once the adder is outside, it can be pipelined, which increases the latency without breaking the capability of single cycle accumulation. The RTL on the right side needs only a single full adder stage in the feedback loop, so the accumulation can run at extremely high speed. This comes at the cost of twice as many registers to keep the accumulated value. The more time consuming operation is outside the accumulator loop and can easily be pipelined as needed. The example above is a very simple demonstration how to split a CSA-tree. It could also involve more complex arithmetic including multipliers. An example of a use case would be a noise-shaper.

1.8. Signed Fixed-Point Dot-Product (Q1.31) |
---|

RTL Code |
Improved PPA |
---|---|

module test_must (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [63:0] t0; /* Calculate intermediate terms */ /* final result with saturation using MSBs for saturation logic */ endmodule |
module test_ppa (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [64:0] xxl; /* Full width dot-product and rounding */ /* final result with saturation done on full width of dot-product */ endmodule |

This example shows a signed fixed-point dot-product with inputs and output values in the range of ±1.0 Q(1.31). The implementation on the right side results in better performance, power, area and the result is more accurate. The output between both implementations can differ by ± 1 LSB. The difference is due to rounding of the intermediate terms (t2 and t3). In many cases there will be a register stage for t2 and t3 to break the critical path and meet timing requirements. The logic depth for the complete dot-product on the right side is actually about the same as the logic depth for the intermediate terms t2 and t3..

The RTL on the right side implements the following improvements to reduce logic depth:

- For good PPA both multipliers and the adders need to be combined into a single Carry-Save-Adder. This could be achieved with a 64 bit adder for xxl on the left side or simply adding both results of the multiplier (t0 and t1) using the full bit width of the result of the multiplier. It is also important to have the correct bit width of the result, even if the application will never overflow and no saturation is needed.
- Rounding needs to be done by including the rounding term in full bit width. In the RTL on the left side the rounding term cannot be included into the Carry-Save-Adder.
- Saturation logic (clipping): In many cases the system requires protection against overflow. In this example the result of the dot-product before rounding and saturation would be Q2.63. To be in range, one MSB bit needs to be dropped, and saturation logic is activated when the 2 MSBs are different. This can be done simply by looking at the MSB once the result of the dot-product is available. The MSB will be the last bit available and would then need buffers to drive the multiplexers to select the values for the saturation logic. On the right side, the saturation logic is done with magnitude comparators of the full result (xxl), which allows merging the magnitude operation into the Carry-Save-Adder. In doing this, an overflow or underflow which will trigger the saturation is detected before the result of the multiplier is available and there is extra time for buffering. This way the penalty for the saturation logic is only a single multiplexer stage.

The dot-product on the right side will have less than 30 levels of logic and in most cases can be implemented without any pipeline registers. Adding pipeline registers on the left side is straight forward, because intermediate terms are available. But it is important to realize that the 32 bit multiplication on the left side has only about 3 levels of logic less than the entire dot-product on the right side. Moving the saturation logic into the next cycle would save one extra level, which might not be sufficient.

These are the options for adding a pipeline stage:

- Adding multiple pipeline stages at the output and allow re-timing in synthesis. This is not always an option because of potential issues with other tools.
- Using partial product terms to replace the 32x32 bit multiplier with 16x32 bit multipliers in the first stage and having another Carry-Save-Adder in the second pipeline stage.
- Low level implementation with a Booth-Compressor and a Wallace-tree in the first stage and the final adder in the second stage. This would result in a very good implementation but lacks flexibility.

2.1. Transformations to Optimize Datapath |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* Carry-in is prioritized */ /* Input ctl is prioritized and after mapping after the comparison */ /* Adding constant 1 does not impact area or performance */ /* Subtract or a + ~b + 1 */ endmodule |
module test_ppa ( wire [17:0] x0_xxl; /* Carry-in merged into adder to speed up inputs operands "a" and "b" */ /* Input ctl is merged into the comparator */ /* Using 1's compliment */ /* Subtract: Path b is untouched */ endmodule |

The above RTL could be used for optimizations to merge Arithmetic onto a single Carry-Save-Adder. The equations can in some cases be very useful but need to be applied very carefully.

- The equations for x0: Often synthesis results are better when an extra increment (carry-in) is inserted as an additional LSB. Synthesis does not seem to analyse carry-in and tends to prioritize carry-in at the cost of all other inputs of the adder. The code on the right side also allows an extra control by replacing the constant 1'b1 with a control which will then reduce the logic depth by an additional level.
- The equation for x1: This is similar as above for x0, merging control into the data-path can improve results or can allow for further optimizations.
- The equation for x2 and x3: Using 1'compliment for the addition allows changing arithmetic in such a way that more optimization steps are possible. See below for an example where these equations allow for extra optimizations.

2.2. Integer (16 bits) Multiply Add Sub Function |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( wire signed [31:0] xxl; /* Using if-statement to implement mul addsub function */ endmodule |
module test_ppa ( wire signed [31:0] xxl; /* Implementation of a mul-addsub function on an mac unit. */ endmodule |

Both codes are formally equivalent. The example shows how the slightly less critical input c can be manipulated to implement the function: c ± a * b onto a multiply add unit. In the example attention needs to be paid to sign extension, because unsigned arithmetic is used to implement signed arithmetic.

2.3. Implementation of Function abs(a-b) on a Single Adder |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* abs(a-b): subtract larger value from the smaller (unsigned) */ endmodule |
module test_ppa ( wire signed [33:0] xxl; /* abs(a-b): On a single data-path using an adder of twice the width. */ /* extraction result from xxl. */ endmodule |

The function abs(a-b) can be implemented on a single adder of twice the width using transformation as described above. Doubling the width of an adder increases the logic depth only by one exta AOI21 (or OAI21) gate. In many cases the extra EXOR gate at the output of the adder can be merged into the following logic.

3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left |
---|

RTL Code |
Improved PPA |
---|---|

module test_must ( /* behavioral code */ wire a_is_zero; assign a_is_zero = a==32'h0; assign zero = x==32'h0; always @(*) begin: pX_OV endmodule |
module test_ppa ( wire [32:0] x_ror; assign sh_lr = sh[4:0] ^ {5{sh_left}}; always @(*) begin: pMASK // calculating mask for result always @(*) begin: pREV_MASK // bit revers (no logic): rev_msk[32:0] = msk[0:32]; always @(*) begin: pX_OV assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] | s_ext; endmodule |

This is an example implementation of logic or arithmetic shift and rotate function (left and right). The RTL on the left side is based on a straightforward behavioural description. The RTL on right minimizes the levels of logic, by mapping all shift operations on a single rotate right operator and in parallel calculating a mask. The result of the shift operation is then simply a logic operation between rotate and the mask. The left/right operation uses a 1'compliment and is then corrected by one extra shift. In addition to the result of the shift operation the rotate and mask allows a simple calculation of flags (zero and overflow) without adding to the logic depth.

3.2. SIN Geneation for 18 bits (EXPERIMENTAL) |
---|

DPI Code (C reference) |
Synthesizable RTL Code |
---|---|

module sin_dpi #( /* endmodule |
module sin_18 ( /* endmodule |

This code is currently available as a pre-release upon request. It has successfully passed regression tests and functions reliably.
However, some essential steps for area optimization are pending and a pipeline-stages between the look-up tables (LUT) stage and the arithmetic could
be beneficial.

The sin-function's computation relies on optimized look-up tables tailored for size efficiency. A full comparison
between the Verilog implementation and a C-reference model is conducted using the DPI interface to make sure the result of the RTL
matches the reference model for each output bit.

Verilator serves as the simulation tool in the provided example.

The input range for the sin function spans 18 bits, precisely from 0.0 (inclusive) to π/2 (exclusive), evenly distributed across
18'h0_0000 to 18'h3_ffff. Correspondingly, the output adheres to a 19-bit unsigned format, UQ1.18. Notably, the maximum achievable
result is 1.0. Consequently, the RTL's output spans between 19'h0_0000 and 19'h4_0000. The precision is maintained,
ensuring bit-accurate results. Every output aligns perfectly with computations done using long double accuracy, following the
formula: x = max_val * sin((a*π) / (2.0*max_val)), where max_val equals 2**18-1.

Despite its functionality and accuracy, it's important to note that optimization steps aimed at enhancing area efficiency are yet to be integrated.