TealSemi

TeallSemi GmbH email: tealsemi@gmail.com HRB: 254216 (Munich) VAT: DE328043994	TealSemi	EMAIL: tealsemi@gmail.com

TealSemi GmbH offers help with front-end RTL design at various stages:

Contribution to RTL design
Optimizing data path
Reviews, suggestions for optimizations
Synthesis
... and more ...

Examples of Optimizing RTL Coding for Performance, Power and Area (PPA)

Below are several optimized RTL code examples, each accompanied by a basic testbench. The downloadable package includes:

Simulation-ready code: All examples can be directly simulated using the provided testbenches.
Icarus Verilog script: A preconfigured script run_sim for immediate simulation in Icarus.
Multi-simulator support: While the examples are preconfigured for Icarus, they can be easily ported to other Verilog simulators (e.g., Xcelium or Questa). See the comments in the run_sim script for guidance on adapting it to different tools.

Table of Contents

1. Fixed-Point and Integer Arithmetic

1.1. Sign Extensions (e.g., 32x16-bit Integer Multiplier with 32-bit Result)

1.2. Mixture of Signed and Unsigned Operations

1.3. Ordering of Multipliers and Adders

1.4. Rounding (Q1.31 Case Study)

1.5. Multiply-Add Operation for Q1.31

1.6. Multiply-Add Operation with Saturation (Q1.31)

1.7. Pipelined Accumulator (Signed)

1.8. Signed Fixed-Point Dot-Product (Q1.31)

2. Datapath Optimization

2.1. Transformations for Datapath Optimization

2.2. Integer (16-bit) Multiply-Add-Sub Function

2.3. Implementing abs(a-b) on a Single Adder

3. Specialized Operations

3.1. Logic, Arithmetic Shifts, and Rotations

3.2. SIN Generation for 18-bit Systems (Experimental)

3.3. Multiplier for Complex Numbers

1.1. Sign Extensions (e.g., 32x16-bit Integer Multiplier with 32-bit Result)

RTL Code	Improved PPA
module test_must ( input wire [31:0] a, input wire [15:0] b, output wire [31:0] x ); /* Signed integer multiplication executed on a unsigned multiplier / assign x = a {{16{b[15]}},b}; endmodule	module test_ppa ( input wire [31:0] a, input wire [15:0] b, output wire [31:0] x ); wire signed [47:0] xxl; /* Signed integer multiplication executed on a signed multiplier / assign xxl = $signed(a) $signed(b); assign x = xxl[31:0]; endmodule

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [15:0] b,
output wire [31:0] x
);

/* Signed integer multiplication executed on a unsigned multiplier */
assign x = a * {{16{b[15]}},b};

endmodule

module test_ppa (
input wire [31:0] a,
input wire [15:0] b,
output wire [31:0] x
);

wire signed [47:0] xxl;

/* Signed integer multiplication executed on a signed multiplier */
assign xxl = $signed(a) * $signed(b);
assign x = xxl[31:0];

endmodule

While both implementations are formally equivalent, the right-side code demonstrates superior synthesis results: smaller area footprint and higher operating frequency. Note that most linting tools (particularly SpyGlass rules like e.g. W116/W164a/W164b for bit-width mismatches and W224 for multi-bit logical operations) will flag the optimized version (right) due to operand bit-width mismatch between a and b - these warnings should be waived during verification. The area optimization proves particularly impactful for multipliers, while showing less significance for adders. These principles apply equally to unsigned operations.

1.2. Mixture of Signed and Unsigned Operations

RTL Code	Improved PPA
module test_must ( input wire signed [15:0] sa, input wire [15:0] ub, output wire signed [31:0] x ); wire [16:0] ua; wire [32:0] uxxl; wire signed [32:0] xxl; /* Arithmetic mapped to unsigned multiplier / assign ua = sa[15]? -sa : sa; assign uxxl = ua ub; assign xxl = $signed(sa[15]? -uxxl : uxxl); assign x = $signed(xxl[31:0]); endmodule	module test_ppa ( input wire signed [15:0] sa, input wire [15:0] ub, output wire signed [31:0] x ); wire signed [32:0] xxl; /* Signed arithmetic mapped to signed multiplier / assign xxl = sa $signed({1'b0,ub}); assign x = $signed(xxl[31:0]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [15:0] sa,
input wire [15:0] ub,
output wire signed [31:0] x
);

wire [16:0] ua;
wire [32:0] uxxl;
wire signed [32:0] xxl;

/* Arithmetic mapped to unsigned multiplier */
assign ua = sa[15]? -sa : sa;
assign uxxl = ua * ub;
assign xxl = $signed(sa[15]? -uxxl : uxxl);
assign x = $signed(xxl[31:0]);

endmodule

module test_ppa (
input wire signed [15:0] sa,
input wire [15:0] ub,
output wire signed [31:0] x
);

wire signed [32:0] xxl;

/* Signed arithmetic mapped to signed multiplier */
assign xxl = sa * $signed({1'b0,ub});
assign x = $signed(xxl[31:0]);

endmodule

Both versions of the code are formally equivalent. When performing arithmetic in Verilog, it is best practice to use signed arithmetic if any operand (including the result) is signed. In such cases, all unsigned operands should be explicitly cast to signed using $signed() to ensure correct and predictable behavior.

In terms of hardware, signed and unsigned multipliers of the same bit-width have similar area and performance characteristics. However, the code on the right is more concise and typically synthesizes to a smaller circuit with less logic depth, since it leverages Verilog’s built-in signed arithmetic and reduces the need for manual sign extension or additional logic.

1.3. Ordering of Multipliers and Adders

RTL Code	Improved PPA
module test_must ( input wire [31:0] a, input wire [31:0] b, output wire [63:0] x ); /* Addition is followed by multiplier / assign x = a (b + 1'b1); endmodule	module test_ppa ( input wire [31:0] a, input wire [31:0] b, output wire [63:0] x ); /* Addition is merged into the multiplier (single CSA tree) / assign x = a b + a; endmodule

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [31:0] b,
output wire [63:0] x
);

/* Addition is followed by multiplier */
assign x = a * (b + 1'b1);

endmodule

module test_ppa (
input wire [31:0] a,
input wire [31:0] b,
output wire [63:0] x
);

/* Addition is merged into the multiplier (single CSA tree) */
assign x = a * b + a;

endmodule

Both codes are formally equivalent. The RTL on the right reduces logic depth while maintaining similar area, as sequential arithmetic operations (e.g., addition followed by multiplication) cannot be merged into a single optimized structure like a Carry-Save Adder, which specifically accelerates multi-operand addition through parallel carry propagation.

1.4. Rounding (Q1.31 Case Study)

RTL Code	Improved PPA
module test_must ( input wire signed [31:0] a, input wire signed [31:0] b, output wire signed [31:0] x ); localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; wire signed [32:0] xxl1; /* Rounding is done on an adder with minimum bit width / assign xxl0 = a b; assign xxl1 = $signed(xxl0[63:31]) + RND; assign x = $signed(xxl1[32:1]); endmodule	module test_ppa ( input wire signed [31:0] a, input wire signed [31:0] b, output wire signed [31:0] x ); localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [63:0] xxl; /* Rounding is merged into the multiplier (single CSA tree) / assign xxl = a b + RND; assign x = $signed(xxl[63:32]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
output wire signed [31:0] x
);

localparam signed [1:0] RND = 2'sd1;

wire signed [63:0] xxl0;
wire signed [32:0] xxl1;

/* Rounding is done on an adder with minimum bit width */
assign xxl0 = a * b;
assign xxl1 = $signed(xxl0[63:31]) + RND;
assign x = $signed(xxl1[32:1]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
output wire signed [31:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [63:0] xxl;

/* Rounding is merged into the multiplier (single CSA tree) */
assign xxl = a * b + RND;
assign x = $signed(xxl[63:32]);

endmodule

The two implementations produce numerically equivalent results. The left RTL code minimizes the adder bit width by truncating before rounding, which introduces additional logic levels. In contrast, the optimized RTL on the right combines the multiplier and rounding operations, allowing the design to be mapped onto a single Carry-Save Adder. This results in reduced logic depth and overall area savings.

1.5. Multiply-Add Operation for Q1.31

RTL Code	Improved PPA
module test_must ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, output wire signed [32:0] x ); localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; wire signed [33:0] xxl1; /* MAC operation and rounding is done on a separate adder with minimum bit width / assign xxl0 = a b; assign xxl1 = $signed(xxl0[63:31]) + $signed({c, RND[0]}); assign x = $signed(xxl1[33:1]); endmodule	module test_ppa ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, output wire signed [32:0] x ); localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* MAC operation and rounding are mapped onto a single CSA tree / assign xxl = a b + $signed({c, RND[31:0]}); assign x = $signed(xxl[64:32]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [32:0] x
);

localparam signed [1:0] RND = 2'sd1;

wire signed [63:0] xxl0;
wire signed [33:0] xxl1;

/* MAC operation and rounding is done on a separate adder with minimum bit width */
assign xxl0 = a * b;
assign xxl1 = $signed(xxl0[63:31]) + $signed({c, RND[0]});
assign x = $signed(xxl1[33:1]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [32:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* MAC operation and rounding are mapped onto a single CSA tree */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x = $signed(xxl[64:32]);

endmodule

Both implementations produce formally equivalent results. As demonstrated in the rounding optimization example, the right-side RTL structure enables synthesis to a single Carry-Save Adder (CSA) by maintaining un-truncated intermediate values. This approach eliminates intermediate carry propagation stages, reducing critical path delay (performance), minimizing logic levels (power), and enabling more compact physical implementation (area) - collectively improving PPA (Power-Performance-Area) metrics.

1.6. Multiply-Add Operation with Saturation (Q1.31)

RTL Code	Improved PPA
module test_must ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, output wire signed [31:0] x ); localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Multiply-Add followed by saturation logic / assign xxl = a b + $signed({c, RND[31:0]}); assign x = (xxl[64:63]==2'b10)? 32'sh8000_0000 : (xxl[64:63]==2'b01)? 32'sh7fff_ffff : $signed(xxl[63:32]); endmodule	module test_ppa ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, output wire signed [31:0] x ); localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff; localparam signed [64:0] MIN_VAL = 65'sh1_8000_0000_0000_0000; localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Control for saturation is mapped into the CSA tree / assign xxl = a b + $signed({c, RND[31:0]}); assign x = (xxl<=MIN_VAL)? 32'sh8000_0000 : (xxl> MAX_VAL)? 32'sh7fff_ffff : $signed(xxl[63:32]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [31:0] x
);

localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Multiply-Add followed by saturation logic */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x =
(xxl[64:63]==2'b10)? 32'sh8000_0000 :
(xxl[64:63]==2'b01)? 32'sh7fff_ffff :
$signed(xxl[63:32]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
output wire signed [31:0] x
);

localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff;
localparam signed [64:0] MIN_VAL = 65'sh1_8000_0000_0000_0000;
localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Control for saturation is mapped into the CSA tree */
assign xxl = a * b + $signed({c, RND[31:0]});
assign x =
(xxl<=MIN_VAL)? 32'sh8000_0000 :
(xxl> MAX_VAL)? 32'sh7fff_ffff :
$signed(xxl[63:32]);

endmodule

While formally equivalent, the right-side implementation demonstrates superior saturation handling through full-width magnitude comparison (xxl term). This method's complete overflow detection enables tighter integration with the Carry-Save Adder architecture, as the comparator logic merges naturally into the CSA's parallel processing structure. The resultant reduction in logic levels improves timing predictability, though ultra low-speed implementations may incur minor area penalties from the wider comparator's physical mapping.

1.7. Pipelined Accumulator (Signed)

RTL Code	Improved PPA
module test_must ( input wire clk, input wire rst_n, input wire signed [31:0] a, output reg signed [63:0] x ); /* register stage: accumulator */ always @( posedge clk or negedge rst_n ) begin if( !rst_n ) begin x <= 32'sh0; end else begin x <= a + x; end end endmodule	module test_ppa ( input wire clk, input wire rst_n, input wire signed [31:0] a, output wire signed [63:0] x ); integer i; wire signed [63:0] aa; wire [63:1] ux; reg [63:0] s; reg [64:0] c; /* sign-extend input a / assign aa = $signed(a); / Register stage: output of 3-2 compressor (only one single full-adder stage) / always @( posedge clk or negedge rst_n ) begin if( !rst_n ) begin s <= 64'h0; c <= 65'h0; end else begin for( i=63; i>=0; i=i-1 ) begin {c[i+1],s[i]} <= aa[i] + s[i] +c[i]; end c[0] <= 1'b0; end end / 32 bit adder is outside critical look and can be pipelined if needed */ assign x = $signed(s) + $signed(c[63:0]); endmodule

RTL Code

Improved PPA

module test_must (
input wire clk,
input wire rst_n,
input wire signed [31:0] a,
output reg signed [63:0] x
);

/* register stage: accumulator */

always @( posedge clk or negedge rst_n ) begin
if( !rst_n ) begin
x <= 32'sh0;
end else begin
x <= a + x;
end
end

endmodule

module test_ppa (
input wire clk,
input wire rst_n,
input wire signed [31:0] a,
output wire signed [63:0] x
);

integer i;

wire signed [63:0] aa;
wire [63:1] ux;
reg [63:0] s;
reg [64:0] c;

/* sign-extend input a */

assign aa = $signed(a);

/* Register stage: output of 3-2 compressor (only one single full-adder stage) */

always @( posedge clk or negedge rst_n ) begin
if( !rst_n ) begin
s <= 64'h0;
c <= 65'h0;
end else begin
for( i=63; i>=0; i=i-1 ) begin
{c[i+1],s[i]} <= aa[i] + s[i] +c[i];
end
c[0] <= 1'b0;
end
end

/* 32 bit adder is outside critical look and can be pipelined if needed */

assign x = $signed(s) + $signed(c[63:0]);

endmodule

The code demonstrates a carry-save accumulator using compression trees for timing closure. The base implementation (left) accumulates input "a" into register x each cycle. For timing-critical cases the following approach can be used:

Pipeline Strategically: Register 3:2 compression outputs and relocate the final CPA (Carry Propagate Adder) outside the loop, enabling multi-cycle addition without accumulation stalls.
Maintain Throughput: The revised RTL (right) preserves single-cycle accumulation using carry-save registers, keeping only a 3:2 compressor in the critical path.
Enable CPA Pipelining: External CPA stages can now be freely pipelined, trading latency for frequency while maintaining accumulation rate.

Architectural Tradeoffs:

Requires 2N registers (N-bit sum + N-bit carry)
Achieves significantly higher clock frequencies

Particularly effective for noise-shapers needing:

Cycle-exact accumulation for error feedback
High-speed coefficient multiplication
Scalable bit-widths through carry-save chaining

1.8. Signed Fixed-Point Dot-Product (Q1.31)

RTL Code	Improved PPA
module test_must ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, input wire signed [31:0] d, output wire signed [31:0] x ); localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000; localparam signed [ 1:0] RND = 2'sh1; wire signed [63:0] t0; wire signed [63:0] t1; wire signed [32:0] t2; wire signed [32:0] t3; wire signed [32:0] xxl; /* Calculate intermediate terms / assign t0 = a b; assign t1 = c * d; assign t2 =$signed(t0[63:31]) + RND; assign t3 =$signed(t1[63:31]) + RND; assign xxl = $signed(t2[32:1]) + $signed(t3[32:1]); /* final result with saturation using MSBs for saturation logic */ assign x = (xxl[32:31]==2'b10)? MIN_VAL32 : (xxl[32:31]==2'b01)? MAX_VAL32 : $signed(xxl[31:0]); endmodule	module test_ppa ( input wire signed [31:0] a, input wire signed [31:0] b, input wire signed [31:0] c, input wire signed [31:0] d, output wire signed [31:0] x ); localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000; localparam signed [64:0] MAX_VAL65 = 65'sh0_7fff_ffff_ffff_ffff; localparam signed [64:0] MIN_VAL65 = 65'sh1_8000_0000_0000_0000; localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Full width dot-product and rounding / assign xxl= a b + c * d + RND; /* final result with saturation done on full width of dot-product */ assign x = (xxl<=MIN_VAL65)? MIN_VAL32 : (xxl> MAX_VAL65)? MAX_VAL32 : $signed(xxl[63:32]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
input wire signed [31:0] d,
output wire signed [31:0] x
);

localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff;
localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000;
localparam signed [ 1:0] RND = 2'sh1;

wire signed [63:0] t0;
wire signed [63:0] t1;
wire signed [32:0] t2;
wire signed [32:0] t3;
wire signed [32:0] xxl;

/* Calculate intermediate terms */
assign t0 = a * b;
assign t1 = c * d;
assign t2 =$signed(t0[63:31]) + RND;
assign t3 =$signed(t1[63:31]) + RND;
assign xxl = $signed(t2[32:1]) + $signed(t3[32:1]);

/* final result with saturation using MSBs for saturation logic */
assign x =
(xxl[32:31]==2'b10)? MIN_VAL32 :
(xxl[32:31]==2'b01)? MAX_VAL32 :
$signed(xxl[31:0]);

endmodule

module test_ppa (
input wire signed [31:0] a,
input wire signed [31:0] b,
input wire signed [31:0] c,
input wire signed [31:0] d,
output wire signed [31:0] x
);

localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff;
localparam signed [31:0] MIN_VAL32 = 32'sh8000_0000;
localparam signed [64:0] MAX_VAL65 = 65'sh0_7fff_ffff_ffff_ffff;
localparam signed [64:0] MIN_VAL65 = 65'sh1_8000_0000_0000_0000;
localparam signed [32:0] RND = 33'sh0_8000_0000;

wire signed [64:0] xxl;

/* Full width dot-product and rounding */
assign xxl= a * b + c * d + RND;

/* final result with saturation done on full width of dot-product */
assign x =
(xxl<=MIN_VAL65)? MIN_VAL32 :
(xxl> MAX_VAL65)? MAX_VAL32 :
$signed(xxl[63:32]);

endmodule

This example shows a signed fixed-point dot-product with inputs and output values in the range of ±1.0 Q(1.31). The implementation on the right side results in better performance, power, and area characteristics. While both implementations can produce results differing by ±1 LSB due to intermediate rounding (t2 and t3), the right-side implementation achieves equivalent logic depth through architectural optimizations.

The RTL on the right side implements the following improvements to reduce logic depth:

Unified arithmetic path: Combines multipliers and adders into a single Carry-Save-Adder structure. This maintains full bit-width processing (critical for Q2.63 intermediate results) while reducing combinatorial delay compared to separate arithmetic blocks.
Integrated rounding: Incorporates rounding terms directly into the CSA structure, avoiding the left-side's discrete rounding stage that cannot leverage CSA optimization.
Early saturation detection: Implements magnitude comparison concurrently with arithmetic operations. This anticipates overflow conditions during computation, reducing saturation logic to a single multiplexer stage. The approach aligns with efficient saturation methods where MSB analysis (for Q2.63→Q1.31 conversion) occurs in parallel with final addition.

The right-side implementation achieves <30 logic levels through:

Carry-save optimization of partial products
Concurrent overflow detection
Elimination of intermediate rounding buffers

Pipeline strategies for both implementations require careful consideration, implementation options to consider are:

Retiming-friendly architecture: Right-side's unified structure better supports automated retiming during synthesis compared to left-side's fragmented arithmetic blocks.
Bit-sliced multipliers: While applicable to both designs, right-side's architecture more readily accommodates 16x32 bit multiplier segmentation without pipeline imbalance.
Booth-Wallace optimization: Both implementations can benefit, but right-side's consolidated data path reduces inter-stage buffering requirements.

2.1. Transformations for Datapath Optimization

RTL Code	Improved PPA
module test_must ( input wire [15:0] a, input wire [15:0] b, input wire cin, input wire ctl, output wire [16:0] x0, output wire x1, output wire [16:0] x2, output wire [16:0] x3 ); /* Carry-in is prioritized / assign x0 = a + b + cin; / Input ctl is prioritized and after mapping after the comparison / assign x1 = ctl? a==b : 1'b0; / Adding constant 1 does not impact area or performance / assign x2 = a + b + 1'b1; / Subtract or a + ~b + 1 */ assign x3 = a - b; endmodule	module test_ppa ( input wire [15:0] a, input wire [15:0] b, input wire cin, input wire ctl, output wire [16:0] x0, output wire x1, output wire [16:0] x2, output wire [16:0] x3 ); wire [17:0] x0_xxl; /* Carry-in merged into adder to speed up inputs operands "a" and "b" / assign x0_xxl = {a,1'b1} + {b,cin}; assign x0 = x0_xxl[17:1]; / Input ctl is merged into the comparator / assign x1 = {1'b1,a} == {ctl,b}; / Using 1's compliment / assign x2 = ~(~{1'b0,a} + ~{1'b0,b}); / Subtract: Path b is untouched */ assign x3 = ~(~{1'b0,a} + {1'b0,b}); endmodule

RTL Code

Improved PPA

module test_must (
input wire [15:0] a,
input wire [15:0] b,
input wire cin,
input wire ctl,
output wire [16:0] x0,
output wire x1,
output wire [16:0] x2,
output wire [16:0] x3
);

/* Carry-in is prioritized */
assign x0 = a + b + cin;

/* Input ctl is prioritized and after mapping after the comparison */
assign x1 = ctl? a==b : 1'b0;

/* Adding constant 1 does not impact area or performance */
assign x2 = a + b + 1'b1;

/* Subtract or a + ~b + 1 */
assign x3 = a - b;

endmodule

module test_ppa (
input wire [15:0] a,
input wire [15:0] b,
input wire cin,
input wire ctl,
output wire [16:0] x0,
output wire x1,
output wire [16:0] x2,
output wire [16:0] x3
);

wire [17:0] x0_xxl;

/* Carry-in merged into adder to speed up inputs operands "a" and "b" */
assign x0_xxl = {a,1'b1} + {b,cin};
assign x0 = x0_xxl[17:1];

/* Input ctl is merged into the comparator */
assign x1 = {1'b1,a} == {ctl,b};

/* Using 1's compliment */
assign x2 = ~(~{1'b0,a} + ~{1'b0,b});

/* Subtract: Path b is untouched */
assign x3 = ~(~{1'b0,a} + {1'b0,b});

endmodule

The above RTL demonstrates various optimization techniques for merging arithmetic operations into a single Carry-Save Adder. While these equations can be highly beneficial, they require careful implementation.

x0 equation: Synthesis often produces better results when inserting an explicit increment (carry-in) as an additional LSB. Synthesis tools typically don't fully analyze carry-in signals, potentially prioritizing them at the expense of other adder inputs. The right-side code enables additional control by replacing the constant 1'b1 with a control signal, reducing logic depth by one level.
x1 equation: Similar to x0 optimization, merging control signals into the datapath can enhance synthesis results and enable subsequent optimization opportunities.
x2 and x3 equations: Using 1's complement arithmetic enables arithmetic transformations that unlock additional optimization potential. This approach allows more flexible logic restructuring, as demonstrated in the following examples.

2.2. Integer (16-bit) Multiply-Add-Sub Function

RTL Code	Improved PPA
module test_must ( input wire signed [15:0] a, input wire signed [15:0] b, input wire signed [15:0] c, input wire ctl_add, output wire signed [15:0] x ); wire signed [31:0] xxl; /* Using if-statement to implement mul addsub function / assign xxl = ctl_add? c + a b : c - a * b; assign x = $signed(xxl[15:0]); endmodule	module test_ppa ( input wire signed [15:0] a, input wire signed [15:0] b, input wire signed [15:0] c, input wire ctl_add, output wire signed [15:0] x ); wire signed [31:0] xxl; /* Implementation of a mul-addsub function on an mac unit. / assign xxl = (({16{!ctl_add}} ^ c) + $unsigned(a) $unsigned(b)); assign x = $signed({16{!ctl_add}} ^ xxl[15:0]); endmodule

RTL Code

Improved PPA

module test_must (
input wire signed [15:0] a,
input wire signed [15:0] b,
input wire signed [15:0] c,
input wire ctl_add,
output wire signed [15:0] x
);

wire signed [31:0] xxl;

/* Using if-statement to implement mul addsub function */
assign xxl = ctl_add? c + a * b : c - a * b;
assign x = $signed(xxl[15:0]);

endmodule

module test_ppa (
input wire signed [15:0] a,
input wire signed [15:0] b,
input wire signed [15:0] c,
input wire ctl_add,
output wire signed [15:0] x
);

wire signed [31:0] xxl;

/* Implementation of a mul-addsub function on an mac unit. */
assign xxl = (({16{!ctl_add}} ^ c) + $unsigned(a) * $unsigned(b));
assign x = $signed({16{!ctl_add}} ^ xxl[15:0]);

endmodule

Both implementations can be formally verified as equivalent through equivalence checking. The example demonstrates how the less critical input c can be combined with a multiply-add unit implementing the function c ± a * b using a single CSA-tree structure. Special attention is required for sign extension when implementing signed arithmetic using unsigned operators, to ensure correct handling of negative values.

2.3. Implementing abs(a-b) on a Single Adder

RTL Code	Improved PPA
module test_must ( input wire [15:0] a, input wire [15:0] b, output wire [15:0] x, output wire flag // asserted when a > b ); /* abs(a-b): subtract larger value from the smaller (unsigned) */ assign flag = a > b; assign x = flag? a - b : b - a; endmodule	module test_ppa ( input wire [15:0] a, input wire [15:0] b, output wire [15:0] x, output wire flag // asserted when a > b ); wire signed [33:0] xxl; /* abs(a-b): On a single data-path using an adder of twice the width. / assign xxl = {a,1'b1,a} + {~b,1'b0,~b}; / extraction result from xxl. */ assign flag = !xxl[16]; // carry bit assign x[0] = !xxl[0]; // x[0] can be simplified assign x[15:1] = xxl[32:18] ^ {15{!flag}}; endmodule

RTL Code

Improved PPA

module test_must (
input wire [15:0] a,
input wire [15:0] b,
output wire [15:0] x,
output wire flag // asserted when a > b
);

/* abs(a-b): subtract larger value from the smaller (unsigned) */
assign flag = a > b;
assign x = flag? a - b : b - a;

endmodule

module test_ppa (
input wire [15:0] a,
input wire [15:0] b,
output wire [15:0] x,
output wire flag // asserted when a > b
);

wire signed [33:0] xxl;

/* abs(a-b): On a single data-path using an adder of twice the width. */
assign xxl = {a,1'b1,a} + {~b,1'b0,~b};

/* extraction result from xxl. */
assign flag = !xxl[16]; // carry bit
assign x[0] = !xxl[0]; // x[0] can be simplified
assign x[15:1] = xxl[32:18] ^ {15{!flag}};

endmodule

The function abs(a-b) can be implemented on a single adder of twice the width using transformation as described above: xxl = {a,1'b1,a} + {~b,1'b0,~b}. Doubling the width of an adder increases the logic depth by only one extra AOI21 (or OAI21) gate level. In many cases, the extra XOR gate at the output of the adder can be merged into the following logic.

3.1. Logic, Arithmetic Shifts, and Rotations

RTL Code	Improved PPA
module test_must ( input wire [31:0] a, input wire [ 5:0] sh, input wire sh_left, // left or right shift input wire sh_arith, // arithmetic or logic shift input wire sh_rot, // rotate or shift output reg [31:0] x, output wire zero, output reg ov // negated except for overflow when shift left ); /* behavioral code / wire a_is_zero; assign a_is_zero = a==32'h0; assign zero = x==32'h0; always @() begin: pX_OV case( {sh_rot, sh_arith, sh_left} ) 3'b000: begin: pLSHR x = a >> sh; ov = 1'b0; end 3'b001: begin: pLSHL reg [31:0] t; {t,x} = a << sh; ov = (t!=32'h0 \|\| sh[5]) && !a_is_zero; end 3'b010: begin: pASHR x = $unsigned($signed(a) >>> sh); ov = 1'b0; end 3'b011: begin: pASHL reg signed [63:0] t; t = $signed(a); t = t <<< sh; x = t[31:0]; ov = (t[63:31] != {33{a[31]}} \|\| sh[5]) && !a_is_zero; end 3'b100, 3'b110: begin: pROR x = {a,a} >> sh[4:0]; ov = 1'b0; end default: begin: pROL reg [31:0] t; {x,t} = {a,a} << sh[4:0]; ov = 1'b0; end endcase end endmodule	module test_ppa ( input wire [31:0] a, input wire [ 5:0] sh, input wire sh_left, // left or right shift input wire sh_arith, // arithmetic or logic shift input wire sh_rot, // rotate or shift output wire [31:0] x, output wire zero, output reg ov // negated except for overflow when shift left ); wire [32:0] x_ror; reg [32:0] msk; reg [32:0] rev_msk; wire a_is_zero; wire [31:0] s_ext; wire [ 4:0] sh_lr; assign sh_lr = sh[4:0] ^ {5{sh_left}}; assign x_ror = {a,a} >> sh_lr; assign s_ext = {32{a[31] && sh_arith && !sh_rot}} & ~msk; assign a_is_zero = a==32'h0; always @() begin: pMASK // calculating mask for result if( sh_rot ) begin // no mask for rotation msk = {33{!sh_left}}; end else begin msk = sh[5]? {33{sh_left}} : 33'h0_ffff_ffff >> sh_lr; end end always @() begin: pREV_MASK // bit revers (no logic): rev_msk[32:0] = msk[0:32]; integer i; for ( i=0 ; i <= 32 ; i=i+1 ) begin rev_msk[i] <= msk[32-i]; end end always @(*) begin: pX_OV case( {sh_rot, sh_arith, sh_left} ) 3'b001: begin: pLSHL ov = ({sh[5], a&rev_msk[31:0]} != 33'h0) && !a_is_zero; end 3'b011: begin: pASHL ov = ({sh[5], a&rev_msk[32:1]} != {1'b0, {32{a[31]}}&rev_msk[32:1]}) && !a_is_zero; end default: begin: pDEFAULT ov = 1'b0; end endcase end assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] \| s_ext; assign zero = sh_left? (a & ~rev_msk[31:0]) == 32'h0 : {s_ext[31], a&rev_msk[32:1]} == 33'h0; endmodule

RTL Code

Improved PPA

module test_must (
input wire [31:0] a,
input wire [ 5:0] sh,
input wire sh_left, // left or right shift
input wire sh_arith, // arithmetic or logic shift
input wire sh_rot, // rotate or shift
output reg [31:0] x,
output wire zero,
output reg ov // negated except for overflow when shift left
);

/* behavioral code */

wire a_is_zero;

assign a_is_zero = a==32'h0;

assign zero = x==32'h0;

always @(*) begin: pX_OV
case( {sh_rot, sh_arith, sh_left} )
3'b000: begin: pLSHR
x = a >> sh;
ov = 1'b0;
end
3'b001: begin: pLSHL
reg [31:0] t;
{t,x} = a << sh;
ov = (t!=32'h0 || sh[5]) && !a_is_zero;
end
3'b010: begin: pASHR
x = $unsigned($signed(a) >>> sh);
ov = 1'b0;
end
3'b011: begin: pASHL
reg signed [63:0] t;
t = $signed(a);
t = t <<< sh;
x = t[31:0];
ov = (t[63:31] != {33{a[31]}} || sh[5]) && !a_is_zero;
end
3'b100,
3'b110: begin: pROR
x = {a,a} >> sh[4:0];
ov = 1'b0;
end
default: begin: pROL
reg [31:0] t;
{x,t} = {a,a} << sh[4:0];
ov = 1'b0;
end
endcase
end

endmodule

module test_ppa (
input wire [31:0] a,
input wire [ 5:0] sh,
input wire sh_left, // left or right shift
input wire sh_arith, // arithmetic or logic shift
input wire sh_rot, // rotate or shift
output wire [31:0] x,
output wire zero,
output reg ov // negated except for overflow when shift left
);

wire [32:0] x_ror;
reg [32:0] msk;
reg [32:0] rev_msk;
wire a_is_zero;
wire [31:0] s_ext;
wire [ 4:0] sh_lr;

assign sh_lr = sh[4:0] ^ {5{sh_left}};
assign x_ror = {a,a} >> sh_lr;
assign s_ext = {32{a[31] && sh_arith && !sh_rot}} & ~msk;
assign a_is_zero = a==32'h0;

always @(*) begin: pMASK // calculating mask for result
if( sh_rot ) begin // no mask for rotation
msk = {33{!sh_left}};
end else begin
msk = sh[5]? {33{sh_left}} : 33'h0_ffff_ffff >> sh_lr;
end
end

always @(*) begin: pREV_MASK // bit revers (no logic): rev_msk[32:0] = msk[0:32];
integer i;
for ( i=0 ; i <= 32 ; i=i+1 ) begin
rev_msk[i] <= msk[32-i];
end
end

always @(*) begin: pX_OV
case( {sh_rot, sh_arith, sh_left} )
3'b001: begin: pLSHL
ov = ({sh[5], a&rev_msk[31:0]} != 33'h0) && !a_is_zero;
end
3'b011: begin: pASHL
ov = ({sh[5], a&rev_msk[32:1]} != {1'b0, {32{a[31]}}&rev_msk[32:1]}) && !a_is_zero;
end
default: begin: pDEFAULT
ov = 1'b0;
end
endcase
end

assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] | s_ext;
assign zero = sh_left? (a & ~rev_msk[31:0]) == 32'h0 : {s_ext[31], a&rev_msk[32:1]} == 33'h0;

endmodule

This example demonstrates a reference RTL implementation of logical/arithmetic shifts and rotations (left/right). The left-side RTL uses a direct behavioral description, while the optimized implementation on the right employs a unified approach: all shift operations are mapped to a single rotate-right operator paired with a dynamically generated mask. The final result is derived through a logic operation between the rotated value and the mask. To handle left/right directionality, the design uses a 1's complement inversion followed by a corrective shift. Beyond computational efficiency, this structure enables streamlined flag generation (e.g., zero, overflow) without increasing logic depth, as both the rotated value and mask inherently contain flag-relevant information.

3.2. SIN Generation for 18-bit Systems (Experimental)

DPI Code (C reference)	Synthesizable RTL Code
module sin_dpi #( parameter integer WI = 18 ) ( input wire [WI-1:0] clk, input wire [WI-1:0] rst_n, input wire [WI-1:0] a, output wire [WI:0] x ); /* long long Usin_dpi::sin( long long a ){ long double angle; long double dy; long long max_val 1LL << WI; long double dy max_angle = std::numbers::pi / (2.0L * (long double)max_val); angle = (long double)a * max_angle; dy = max_val * std::sin( angle ); return (long long)round(dy); } */ endmodule	module sin_18 ( input wire [WI-1:0] clk, input wire [WI-1:0] rst_n, input wire [17:0] a, output wire [18:0] x ); /* Sin calculation using LUTs followed by arithmetic ... to see code, please download ... */ endmodule

DPI Code (C reference)

Synthesizable RTL Code

module sin_dpi #(
parameter integer WI = 18
) (
input wire [WI-1:0] clk,
input wire [WI-1:0] rst_n,
input wire [WI-1:0] a,
output wire [WI:0] x
);

/*
long long Usin_dpi::sin( long long a ){
long double angle;
long double dy;
long long max_val 1LL << WI;
long double dy max_angle = std::numbers::pi / (2.0L * (long double)max_val);
angle = (long double)a * max_angle;
dy = max_val * std::sin( angle );
return (long long)round(dy);
} */

endmodule

module sin_18 (
input wire [WI-1:0] clk,
input wire [WI-1:0] rst_n,
input wire [17:0] a,
output wire [18:0] x
);

/*
Sin calculation using LUTs followed by arithmetic
... to see code, please download ...
*/

endmodule

The sin-function's computation relies on optimized look-up tables tailored for size efficiency. A full comparison between the Verilog implementation and a C-reference model is conducted using the DPI interface to ensure bit-accurate matching of results. Verilator serves as the simulation tool in the provided example. The input range spans 18 bits (0.0 inclusive to π/2 exclusive), mapped to 18'h0_0000 to 18'h3_ffff. The output uses 19-bit unsigned UQ1.18 format (19'h0_0000 to 19'h4_0000), where the maximum value 19'h4_0000 represents 1.0 exactly. Precision is maintained through the formula: x = max_val * sin((a*π)/(2.0*max_val)) where max_val = 2¹⁸-1, verified against long double precision computations. Optimization steps for area efficiency remain pending implementation.

3.3. Multiplier for Complex Numbers

RTL Code using 4 Multipliers (Resources: 4 parallel multipliers + 2 adders)	Alternative Implementation using 3 Muliplipliers (Resources: 3 multipliers + 5 adders with data dependencies)
module test_must ( input wire signed [15:0] a, // input term1 (real part) input wire signed [15:0] b, // input term1 (imaginary part) input wire signed [15:0] c, // input term2 (real part) input wire signed [15:0] d, // input term2 (imaginary part) output wire signed [32:0] x, // result of product (real part) output wire signed [32:0] y // result of product (imaginary part) ); /* behavioral code: (a+bi) * (c+di) = (ac−bd) + (ad+bc)i / assign x = a c - b * d; assign y = a * d + b * c; endmodule	module test_ppa ( input wire signed [15:0] a, // input term1 (real part) input wire signed [15:0] b, // input term1 (imaginary part) input wire signed [15:0] c, // input term2 (real part) input wire signed [15:0] d, // input term2 (imaginary part) output wire signed [32:0] x, // result of product (real part) output wire signed [32:0] y // result of product (imaginary part) ); // alternative code (Gauss method) using 3 multiplications with extra additions // t1 = a * c // t2 = b * d // t3 = (a+b) * (c+d) // x = t1 - t2 (real) // y = t3 - t1 - t2 (imaginary) wire signed [31:0] t1; wire signed [31:0] t2; wire signed [33:0] t3; wire signed [16:0] s1; wire signed [16:0] s2; wire signed [34:0] y_xxl; assign t1 = a * c; assign t2 = b * d; assign s1 = a + b; assign s2 = c + d; /* optional pipeline stage / assign x = t1 - t2; assign y_xxl = s1 s2 - t1 -t2; assign y = $signed(y_xxl[32:0]); endmodule

RTL Code using 4 Multipliers
(Resources: 4 parallel multipliers + 2 adders)

Alternative Implementation using 3 Muliplipliers
(Resources: 3 multipliers + 5 adders with data dependencies)

module test_must (
input wire signed [15:0] a, // input term1 (real part)
input wire signed [15:0] b, // input term1 (imaginary part)
input wire signed [15:0] c, // input term2 (real part)
input wire signed [15:0] d, // input term2 (imaginary part)
output wire signed [32:0] x, // result of product (real part)
output wire signed [32:0] y // result of product (imaginary part)
);

/* behavioral code: (a+bi) * (c+di) = (ac−bd) + (ad+bc)i */

assign x = a * c - b * d;
assign y = a * d + b * c;

endmodule

module test_ppa (
input wire signed [15:0] a, // input term1 (real part)
input wire signed [15:0] b, // input term1 (imaginary part)
input wire signed [15:0] c, // input term2 (real part)
input wire signed [15:0] d, // input term2 (imaginary part)
output wire signed [32:0] x, // result of product (real part)
output wire signed [32:0] y // result of product (imaginary part)
);

// alternative code (Gauss method) using 3 multiplications with extra additions
// t1 = a * c
// t2 = b * d
// t3 = (a+b) * (c+d)
// x = t1 - t2 (real)
// y = t3 - t1 - t2 (imaginary)

wire signed [31:0] t1;
wire signed [31:0] t2;
wire signed [33:0] t3;
wire signed [16:0] s1;
wire signed [16:0] s2;
wire signed [34:0] y_xxl;

assign t1 = a * c;
assign t2 = b * d;
assign s1 = a + b;
assign s2 = c + d;

/* optional pipeline stage */

assign x = t1 - t2;
assign y_xxl = s1 * s2 - t1 -t2;
assign y = $signed(y_xxl[32:0]);

endmodule

The alternative complex multiplication implementation (Gauss method) uses only 3 real multipliers but requires 5 additions. This increased computational dependency may require additional pipeline stages to maintain clock frequency compared to the 4-multiplier approach.

TeallSemi GmbH email: tealsemi@gmail.com HRB: 254216 (Munich) VAT: DE328043994
Impressum