TeallSemi GmbH
email: tealsemi@gmail.com
HRB: 254216 (Munich)
VAT: DE328043994

TealSemi

EMAIL: tealsemi@gmail.com


TealSemi GmbH offers help with front-end RTL design at various stages:



Examples of Optimizing RTL Coding for Performance, Power and Area (PPA)

Below are a few RTL code examples from recent RTL optimizations. All examples (including a basic test bench) can be downloaded and simulated. The downloaded file include a script for running simulation using Icarus as a simulation tool. The examples can be easily ported to any other Verilog simulator. Examples how to use Xcelium or Questa, please see comments in script run_sim, which is part of each downloaded package.


1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result)
1.2. Mixture of Signed and Unsigned Operations
1.3. Ordering of Multipliers and Adders
1.4. Rounding (e.g. for Q1.31)
1.5. Multiply-Add Operation for Q1.31
1.6. Multiply-Add Operation with Saturation for (Q1.31)
1.7. Pipelined Accumulator (Signed)
1.8. Signed Fixed-Point Dot-Product (Q1.31)
2.1. Transformations to Optimize Datapath
2.2. Integer (16 bits) Multiply Add Sub Function
2.3. Implementation of Function abs(a-b) on a Single Adder
3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left
3.2. SIN Geneation for 18 bits (EXPERIMENTAL)

1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result)

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [15:0] b,
output wire [31:0] x
);

/* Signed integer multiplication executed on a unsigned multiplier */
assign x = a * {{16{b[15]}},b};

endmodule

module test_ppa (
input wire [31:0] a,
input wire [15:0] b,
output wire [31:0] x
);

wire signed [47:0] xxl;

/* Signed integer multiplication executed on a signed multiplier */
assign xxl = $signed(a) * $signed(b);
assign x = xxl[31:0];

endmodule

Both sides are formally equivalent, but the code on the right side synthesizes to smaller area and higher speed. Most linting tools will issue a warning for the improved code (on the right side) because the bit width of the operands a and b are not equal. This warning needs to be waived. In terms of area the difference can be significant for multipliers but less significant for adders. The same applies for unsigned operations.

1.2. Mixture of Signed and Unsigned Operations

RTL Code

Improved PPA

module test_must (
input wire signed [15:0] sa,
input wire [15:0] ub,
output wire signed [31:0] x
);

wire [16:0] ua;
wire [32:0] uxxl;
wire signed [32:0] xxl;

/* Arithmetic mapped to unsigned multiplier */
assign ua = sa[15]? -sa : sa;
assign uxxl = ua * ub;
assign xxl = $signed(sa[15]? -uxxl : uxxl);
assign x = $signed(xxl[31:0]);

endmodule

module test_ppa (
input wire signed [15:0] sa,
input wire [15:0] ub,
output wire signed [31:0] x
);

wire signed [32:0] xxl;

/* Signed arithmetic mapped to signed multiplier */
assign xxl = sa * $signed({1'b0,ub});
assign x = $signed(xxl[31:0]);

endmodule

Both codes are formally equivalent. It is best to use signed arithmetic if one term of the arithmetic is signed (including the result). All unsigned terms need casting to signed terms. An unsigned multiplier has about the same size as an signed multiplier. The code on the right side is smaller and has less logic depth.

1.3. Ordering of Multipliers and Adders

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [31:0] b,
output wire [63:0] x
);

/* Addition is followed by multiplier */
assign x = a * (b + 1'b1);

endmodule

module test_ppa (
input wire [31:0] a,
input wire [31:0] b,
output wire [63:0] x
);

/* Addition is merged into the multiplier (single CSA tree) */
assign x = a * b + a;

endmodule

Both codes are formally equivalent. The RTL on the right side has less level of logic and about the same area, because an addition followed by a multiplication (as coded on the left side) cannot be merged into a single Carry-Save-Adder.

1.4. Rounding (e.g. for Q1.31)

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
output wire signed [31:0] x
);

localparam signed [1:0] RND = 2'sd1;

wire signed [63:0] xxl0;
wire signed [32:0] xxl1;

/* Rounding is done on an adder with minimum bit width */
assign xxl0 = a * b;
assign xxl1 = $signed(xxl0[63:31]) + RND;
assign x = $signed(xxl1[32:1]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
output wire signed [31:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [63:0] xxl;

/* Rounding is merged into the multiplier (single CSA tree) */
assign xxl = a * b + RND;
assign x = $signed(xxl[63:32]);

endmodule

Both codes are formally equivalent. In the RTL on the left side the bit width of the adder is minimized, but the improved RTL on the right side has less level of logic and less area because it can be mapped onto a single Carry-Save-Adder.

1.5. Multiply-Add Operation for Q1.31

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [32:0] x
);

localparam signed [1:0] RND = 2'sd1;

wire signed [63:0] xxl0;
wire signed [33:0] xxl1;

/* MAC operation and rounding is done on a separate adder with minimum bit width */
assign xxl0 = a * b;
assign xxl1 = $signed(xxl0[63:31]) + $signed({c, RND[0]});
assign x = $signed(xxl1[33:1]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [32:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* MAC operation and rounding are mapped onto a single CSA tree */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x = $signed(xxl[64:32]);

endmodule

Both codes are formally equivalent. As seen in the example above for rounding, the RTL on the right side can be mapped to a single Carry-Save-Adder and significantly improves PPA.

1.6. Multiply-Add Operation with Saturation for (Q1.31)

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [31:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Multiply-Add followed by saturation logic */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x =
(xxl[64:63]==2'b10)? 32'sh8000_0000 :
(xxl[64:63]==2'b01)? 32'sh7fff_ffff :
$signed(xxl[63:32]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [31:0] x
);

localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff;
localparam signed [64:0] MIN_VAL = 65'sh1_8000_0000_0000_0000;
localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Control for saturation is mapped into the CSA tree */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x =
(xxl<=MIN_VAL)? 32'sh8000_0000 :
(xxl> MAX_VAL)? 32'sh7fff_ffff :
$signed(xxl[63:32]);

endmodule

The code on both sides is formally equivalent. In many cases of fixed-point arithmetic, saturation logic needs to be added to protect an overflow of the Multiply-Add operation. If the condition of the overflow is using a magnitude comparator (see right side) of the full width (see term xxl) then the control for the multiplexer of the saturtion logic will be merged into the Carry-Save-Adder. This will result in less level of logic. For an ultra low-speed implementation this might cause a slight increase in area.

1.7. Pipelined Accumulator (Signed)

RTL Code

Improved PPA

module test_must (
input wire clk,
input wire rst_n,
input wire signed [31:0] a,
output reg signed [63:0] x
);

/* register stage: accumulator */

always @( posedge clk or negedge rst_n ) begin
if( !rst_n ) begin
x <= 32'sh0;
end else begin
x <= a + x;
end
end

endmodule

module test_ppa (
input wire clk,
input wire rst_n,
input wire signed [31:0] a,
output wire signed [63:0] x
);

integer i;

wire signed [63:0] aa;
wire [63:1] ux;
reg [63:0] s;
reg [64:0] c;

/* sign-extend input a */

assign aa = $signed(a);

/* Register stage: output of 3-2 compressor (only one single full-adder stage) */

always @( posedge clk or negedge rst_n ) begin
if( !rst_n ) begin
s <= 64'h0;
c <= 65'h0;
end else begin
for( i=63; i>=0; i=i-1 ) begin
{c[i+1],s[i]} <= aa[i] + s[i] +c[i];
end
c[0] <= 1'b0;
end
end

/* 32 bit adder is outside critical look and can be pipelined if needed */

assign x = $signed(s) + $signed(c[63:0]);

endmodule

The code above demonstrates how a compressor (Wallacetree) can be used if timing cannot be closed in an accumulator stage. On the left side the input "a" is accumulated every clock cycle to register x (output). If timing cannot be met, one way to add pipeline stages and still accumulate every clock cycle is to register the CSA stage and move the final adder outside the accumulator stage. Once the adder is outside, it can be pipelined, which increases the latency without breaking the capability of single cycle accumulation. The RTL on the right side needs only a single full adder stage in the feedback loop, so the accumulation can run at extremely high speed. This comes at the cost of twice as many registers to keep the accumulated value. The more time consuming operation is outside the accumulator loop and can easily be pipelined as needed. The example above is a very simple demonstration how to split a CSA-tree. It could also involve more complex arithmetic including multipliers. An example of a use case would be a noise-shaper.

1.8. Signed Fixed-Point Dot-Product (Q1.31)

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
input wire signed [31:0] d,
output wire signed [31:0] x
);

localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff;
localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000;
localparam signed [ 1:0] RND = 2'sh1;

wire signed [63:0] t0;
wire signed [63:0] t1;
wire signed [32:0] t2;
wire signed [32:0] t3;
wire signed [32:0] xxl;

/* Calculate intermediate terms */
assign t0 = a * b;
assign t1 = c * d;
assign t2 =$signed(t0[63:31]) + RND;
assign t3 =$signed(t1[63:31]) + RND;
assign xxl = $signed(t2[32:1]) + $signed(t3[32:1]);

/* final result with saturation using MSBs for saturation logic */
assign x =
(xxl[32:31]==2'b10)? MIN_VAL32 :
(xxl[32:31]==2'b01)? MAX_VAL32 :
$signed(xxl[31:0]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
input wire signed [31:0] d,
output wire signed [31:0] x
);

localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff;
localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000;
localparam signed [64:0] MAX_VAL65 = 65'sh0_7fff_ffff_ffff_ffff;
localparam signed [64:0] MIN_VAL65 = 65'sh1_8000_0000_0000_0000;
localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Full width dot-product and rounding */
assign xxl= a * b + c * d + RND;

/* final result with saturation done on full width of dot-product */
assign x =
(xxl<=MIN_VAL65)? MIN_VAL32 :
(xxl> MAX_VAL65)? MAX_VAL32 :
$signed(xxl[63:32]);

endmodule

This example shows a signed fixed-point dot-product with inputs and output values in the range of ±1.0 Q(1.31). The implementation on the right side results in better performance, power, area and the result is more accurate. The output between both implementations can differ by ± 1 LSB. The difference is due to rounding of the intermediate terms (t2 and t3). In many cases there will be a register stage for t2 and t3 to break the critical path and meet timing requirements. The logic depth for the complete dot-product on the right side is actually about the same as the logic depth for the intermediate terms t2 and t3..

The RTL on the right side implements the following improvements to reduce logic depth:

  1. For good PPA both multipliers and the adders need to be combined into a single Carry-Save-Adder. This could be achieved with a 64 bit adder for xxl on the left side or simply adding both results of the multiplier (t0 and t1) using the full bit width of the result of the multiplier. It is also important to have the correct bit width of the result, even if the application will never overflow and no saturation is needed.
  2. Rounding needs to be done by including the rounding term in full bit width. In the RTL on the left side the rounding term cannot be included into the Carry-Save-Adder.
  3. Saturation logic (clipping): In many cases the system requires protection against overflow. In this example the result of the dot-product before rounding and saturation would be Q2.63. To be in range, one MSB bit needs to be dropped, and saturation logic is activated when the 2 MSBs are different. This can be done simply by looking at the MSB once the result of the dot-product is available. The MSB will be the last bit available and would then need buffers to drive the multiplexers to select the values for the saturation logic. On the right side, the saturation logic is done with magnitude comparators of the full result (xxl), which allows merging the magnitude operation into the Carry-Save-Adder. In doing this, an overflow or underflow which will trigger the saturation is detected before the result of the multiplier is available and there is extra time for buffering. This way the penalty for the saturation logic is only a single multiplexer stage.

The dot-product on the right side will have less than 30 levels of logic and in most cases can be implemented without any pipeline registers. Adding pipeline registers on the left side is straight forward, because intermediate terms are available. But it is important to realize that the 32 bit multiplication on the left side has only about 3 levels of logic less than the entire dot-product on the right side. Moving the saturation logic into the next cycle would save one extra level, which might not be sufficient.

These are the options for adding a pipeline stage:

  1. Adding multiple pipeline stages at the output and allow re-timing in synthesis. This is not always an option because of potential issues with other tools.
  2. Using partial product terms to replace the 32x32 bit multiplier with 16x32 bit multipliers in the first stage and having another Carry-Save-Adder in the second pipeline stage.
  3. Low level implementation with a Booth-Compressor and a Wallace-tree in the first stage and the final adder in the second stage. This would result in a very good implementation but lacks flexibility.

2.1. Transformations to Optimize Datapath

RTL Code

Improved PPA

module test_must (
input wire [15:0] a,
input wire [15:0] b,
input wire cin,
input wire ctl,
output wire [16:0] x0,
output wire x1,
output wire [16:0] x2,
output wire [16:0] x3
);

/* Carry-in is prioritized */
assign x0 = a + b + cin;

/* Input ctl is prioritized and after mapping after the comparison */
assign x1 = ctl? a==b : 1'b0;

/* Adding constant 1 does not impact area or performance */
assign x2 = a + b + 1'b1;

/* Subtract or a + ~b + 1 */
assign x3 = a - b;

endmodule

module test_ppa (
input wire [15:0] a,
input wire [15:0] b,
input wire cin,
input wire ctl,
output wire [16:0] x0,
output wire x1,
output wire [16:0] x2,
output wire [16:0] x3
);

wire [17:0] x0_xxl;

/* Carry-in merged into adder to speed up inputs operands "a" and "b" */
assign x0_xxl = {a,1'b1} + {b,cin};
assign x0 = x0_xxl[17:1];

/* Input ctl is merged into the comparator */
assign x1 = {1'b1,a} == {ctl,b};

/* Using 1's compliment */
assign x2 = ~(~{1'b0,a} + ~{1'b0,b});

/* Subtract: Path b is untouched */
assign x3 = ~(~{1'b0,a} + {1'b0,b});

endmodule

The above RTL could be used for optimizations to merge Arithmetic onto a single Carry-Save-Adder. The equations can in some cases be very useful but need to be applied very carefully.

2.2. Integer (16 bits) Multiply Add Sub Function

RTL Code

Improved PPA

module test_must (
input wire signed [15:0] a,
input wire signed [15:0] b,
input wire signed [15:0] c,
input wire ctl_add,
output wire signed [15:0] x
);

wire signed [31:0] xxl;

/* Using if-statement to implement mul addsub function */
assign xxl = ctl_add? c + a * b : c - a * b;
assign x = $signed(xxl[15:0]);

endmodule

module test_ppa (
input wire signed [15:0] a,
input wire signed [15:0] b,
input wire signed [15:0] c,
input wire ctl_add,
output wire signed [15:0] x
);

wire signed [31:0] xxl;

/* Implementation of a mul-addsub function on an mac unit. */
assign xxl = (({16{!ctl_add}} ^ c) + $unsigned(a) * $unsigned(b));
assign x = $signed({16{!ctl_add}} ^ xxl[15:0]);

endmodule

Both codes are formally equivalent. The example shows how the slightly less critical input c can be manipulated to implement the function: c ± a * b onto a multiply add unit. In the example attention needs to be paid to sign extension, because unsigned arithmetic is used to implement signed arithmetic.

2.3. Implementation of Function abs(a-b) on a Single Adder

RTL Code

Improved PPA

module test_must (
input wire [15:0] a,
input wire [15:0] b,
output wire [15:0] x,
output wire flag  // asserted when a > b
);

/* abs(a-b): subtract larger value from the smaller (unsigned) */
assign flag = a > b;
assign x = flag? a - b : b - a;

endmodule

module test_ppa (
input wire [15:0] a,
input wire [15:0] b,
output wire [15:0] x,
output wire flag  // asserted when a > b
);

wire signed [33:0] xxl;

/* abs(a-b): On a single data-path using an adder of twice the width. */
assign xxl = {a,1'b1,a} + {~b,1'b0,~b};

/* extraction result from xxl. */
assign flag = !xxl[16];  // carry bit
assign x[0] = !xxl[0];  // x[0] can be simplified
assign x[15:1] = xxl[32:18] ^ {15{!flag}};

endmodule

The function abs(a-b) can be implemented on a single adder of twice the width using transformation as described above. Doubling the width of an adder increases the logic depth only by one exta AOI21 (or OAI21) gate. In many cases the extra EXOR gate at the output of the adder can be merged into the following logic.

3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [ 5:0] sh,
input wire sh_left,  // left or right shift
input wire sh_arith,  // arithmetic or logic shift
input wire sh_rot,  // rotate or shift
output reg [31:0] x,
output wire zero,
output reg ov  // negated except for overflow when shift left
);

/* behavioral code */

wire a_is_zero;

assign a_is_zero = a==32'h0;

assign zero = x==32'h0;

always @(*) begin: pX_OV
case( {sh_rot, sh_arith, sh_left} )
3'b000: begin: pLSHR
x = a >> sh;
ov = 1'b0;
end
3'b001: begin: pLSHL
reg [31:0] t;
{t,x} = a << sh;
ov = (t!=32'h0 || sh[5]) && !a_is_zero;
end
3'b010: begin: pASHR
x = $unsigned($signed(a) >>> sh);
ov = 1'b0;
end
3'b011: begin: pASHL
reg signed [63:0] t;
t = $signed(a);
t = t <<< sh;
x = t[31:0];
ov = (t[63:31] != {33{a[31]}} || sh[5]) && !a_is_zero;
end
3'b100,
3'b110: begin: pROR
x = {a,a} >> sh[4:0];
ov = 1'b0;
end
default: begin: pROL
reg [31:0] t;
{x,t} = {a,a} << sh[4:0];
ov = 1'b0;
end
endcase
end

endmodule

module test_ppa (
input wire [31:0] a,
input wire [ 5:0] sh,
input wire sh_left,  // left or right shift
input wire sh_arith,  // arithmetic or logic shift
input wire sh_rot,  // rotate or shift
output wire [31:0] x,
output wire zero,
output reg ov  // negated except for overflow when shift left
);

wire [32:0] x_ror;
reg [32:0] msk;
reg [32:0] rev_msk;
wire a_is_zero;
wire [31:0] s_ext;
wire [ 4:0] sh_lr;

assign sh_lr = sh[4:0] ^ {5{sh_left}};
assign x_ror = {a,a} >> sh_lr;
assign s_ext = {32{a[31] && sh_arith && !sh_rot}} & ~msk;
assign a_is_zero = a==32'h0;

always @(*) begin: pMASK  // calculating mask for result
if( sh_rot ) begin  // no mask for rotation
msk = {33{!sh_left}};
end else begin
msk = sh[5]? {33{sh_left}} : 33'h0_ffff_ffff >> sh_lr;
end
end

always @(*) begin: pREV_MASK  // bit revers (no logic): rev_msk[32:0] = msk[0:32];
integer i;
for ( i=0 ; i <= 32 ; i=i+1 ) begin
rev_msk[i] <= msk[32-i];
end
end

always @(*) begin: pX_OV
case( {sh_rot, sh_arith, sh_left} )
3'b001: begin: pLSHL
ov = ({sh[5], a&rev_msk[31:0]} != 33'h0) && !a_is_zero;
end
3'b011: begin: pASHL
ov = ({sh[5], a&rev_msk[32:1]} != {1'b0, {32{a[31]}}&rev_msk[32:1]}) && !a_is_zero;
end
default: begin: pDEFAULT
ov = 1'b0;
end
endcase
end

assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] | s_ext;
assign zero = sh_left? (a & ~rev_msk[31:0]) == 32'h0 : {s_ext[31], a&rev_msk[32:1]} == 33'h0;

endmodule

This is an example implementation of logic or arithmetic shift and rotate function (left and right). The RTL on the left side is based on a straightforward behavioural description. The RTL on right minimizes the levels of logic, by mapping all shift operations on a single rotate right operator and in parallel calculating a mask. The result of the shift operation is then simply a logic operation between rotate and the mask. The left/right operation uses a 1'compliment and is then corrected by one extra shift. In addition to the result of the shift operation the rotate and mask allows a simple calculation of flags (zero and overflow) without adding to the logic depth.

3.2. SIN Geneation for 18 bits (EXPERIMENTAL)

DPI Code (C reference)

Synthesizable RTL Code

module sin_dpi #(
parameter integer WI = 18
) (
input wire [WI-1:0] clk,
input wire [WI-1:0] rst_n,
input wire [WI-1:0] a,
output wire [WI:0] x
);

/*
long long Usin_dpi::sin( long long a ){
long double angle;
long double dy;
long long max_val 1LL << WI;
long double dy max_angle = std::numbers::pi / (2.0L * (long double)max_val);
angle = (long double)a * max_angle;
dy = max_val * std::sin( angle );
return (long long)round(dy);
} */

endmodule

module sin_18 (
input wire [WI-1:0] clk,
input wire [WI-1:0] rst_n,
input wire [17:0] a,
output wire [18:0] x
);

/*
Sin calculation using LUTs followed by arithmetic
... to see code, please download ...
*/

endmodule

This code is currently available as a pre-release upon request. It has successfully passed regression tests and functions reliably. However, some essential steps for area optimization are pending and a pipeline-stages between the look-up tables (LUT) stage and the arithmetic could be beneficial.
The sin-function's computation relies on optimized look-up tables tailored for size efficiency. A full comparison between the Verilog implementation and a C-reference model is conducted using the DPI interface to make sure the result of the RTL matches the reference model for each output bit.
Verilator serves as the simulation tool in the provided example.
The input range for the sin function spans 18 bits, precisely from 0.0 (inclusive) to π/2 (exclusive), evenly distributed across 18'h0_0000 to 18'h3_ffff. Correspondingly, the output adheres to a 19-bit unsigned format, UQ1.18. Notably, the maximum achievable result is 1.0. Consequently, the RTL's output spans between 19'h0_0000 and 19'h4_0000. The precision is maintained, ensuring bit-accurate results. Every output aligns perfectly with computations done using long double accuracy, following the formula: x = max_val * sin((a*π) / (2.0*max_val)), where max_val equals 2**18-1.
Despite its functionality and accuracy, it's important to note that optimization steps aimed at enhancing area efficiency are yet to be integrated.



TeallSemi GmbH      email: tealsemi@gmail.com      HRB: 254216 (Munich)      VAT: DE328043994
Impressum