TeallSemi GmbH email: tealsemi@gmail.com HRB: 254216 (Munich) VAT: DE328043994 |
TealSemi |
EMAIL: tealsemi@gmail.com |
---|
TealSemi GmbH offers help with front-end RTL design at various stages:
Below are several optimized RTL code examples, each accompanied by a basic testbench. The downloadable package includes:
run_sim
for immediate simulation in Icarus.run_sim
script for guidance on adapting it to different tools.
1.1. Sign Extensions (e.g., 32x16-bit Integer Multiplier with 32-bit Result) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Signed integer multiplication executed on a unsigned multiplier */ endmodule |
module test_ppa ( wire signed [47:0] xxl; /* Signed integer multiplication executed on a signed multiplier */ endmodule |
While both implementations are formally equivalent, the right-side code demonstrates superior synthesis results: smaller area footprint and higher operating frequency. Note that most linting tools (particularly SpyGlass rules like e.g. W116/W164a/W164b for bit-width mismatches and W224 for multi-bit logical operations) will flag the optimized version (right) due to operand bit-width mismatch between a
and b
- these warnings should be waived during verification. The area optimization proves particularly impactful for multipliers, while showing less significance for adders. These principles apply equally to unsigned operations.
1.2. Mixture of Signed and Unsigned Operations |
---|
RTL Code |
Improved PPA |
---|---|
module test_must (
wire [16:0] ua; /* Arithmetic mapped to unsigned multiplier */ endmodule |
module test_ppa ( wire signed [32:0] xxl;
/* Signed arithmetic mapped to signed multiplier */ endmodule |
Both versions of the code are formally equivalent. When performing arithmetic in Verilog, it is best practice to use signed arithmetic if any operand (including the result) is signed. In such cases, all unsigned operands should be explicitly cast to signed using $signed()
to ensure correct and predictable behavior.
In terms of hardware, signed and unsigned multipliers of the same bit-width have similar area and performance characteristics. However, the code on the right is more concise and typically synthesizes to a smaller circuit with less logic depth, since it leverages Verilog’s built-in signed arithmetic and reduces the need for manual sign extension or additional logic.
1.3. Ordering of Multipliers and Adders |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Addition is followed by multiplier */ endmodule |
module test_ppa ( /* Addition is merged into the multiplier (single CSA tree) */ endmodule |
Both codes are formally equivalent. The RTL on the right reduces logic depth while maintaining similar area, as sequential arithmetic operations (e.g., addition followed by multiplication) cannot be merged into a single optimized structure like a Carry-Save Adder, which specifically accelerates multi-operand addition through parallel carry propagation.
1.4. Rounding (Q1.31 Case Study) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* Rounding is done on an adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [63:0] xxl; /* Rounding is merged into the multiplier (single CSA tree) */ endmodule |
The two implementations produce numerically equivalent results. The left RTL code minimizes the adder bit width by truncating before rounding, which introduces additional logic levels. In contrast, the optimized RTL on the right combines the multiplier and rounding operations, allowing the design to be mapped onto a single Carry-Save Adder. This results in reduced logic depth and overall area savings.
1.5. Multiply-Add Operation for Q1.31 |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* MAC operation and rounding is done on a separate adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* MAC operation and rounding are mapped onto a single CSA tree */ endmodule |
Both implementations produce formally equivalent results. As demonstrated in the rounding optimization example, the right-side RTL structure enables synthesis to a single Carry-Save Adder (CSA) by maintaining un-truncated intermediate values. This approach eliminates intermediate carry propagation stages, reducing critical path delay (performance), minimizing logic levels (power), and enabling more compact physical implementation (area) - collectively improving PPA (Power-Performance-Area) metrics.
1.6. Multiply-Add Operation with Saturation (Q1.31) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Multiply-Add followed by saturation logic */ endmodule |
module test_ppa (
localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff; wire signed [64:0] xxl; /* Control for saturation is mapped into the CSA tree */ endmodule |
While formally equivalent, the right-side implementation demonstrates superior saturation handling through full-width magnitude comparison (xxl
term). This method's complete overflow detection enables tighter integration with the Carry-Save Adder architecture, as the comparator logic merges naturally into the CSA's parallel processing structure. The resultant reduction in logic levels improves timing predictability, though ultra low-speed implementations may incur minor area penalties from the wider comparator's physical mapping.
1.7. Pipelined Accumulator (Signed) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* register stage: accumulator */ always @( posedge clk or negedge rst_n ) begin endmodule |
module test_ppa ( integer i; wire signed [63:0] aa; /* sign-extend input a */ assign aa = $signed(a); /* Register stage: output of 3-2 compressor (only one single full-adder stage) */ always @( posedge clk or negedge rst_n ) begin /* 32 bit adder is outside critical look and can be pipelined if needed */ assign x = $signed(s) + $signed(c[63:0]); endmodule |
The code demonstrates a carry-save accumulator using compression trees for timing closure. The base implementation (left) accumulates input "a" into register x each cycle. For timing-critical cases the following approach can be used:
Architectural Tradeoffs:
Particularly effective for noise-shapers needing:
1.8. Signed Fixed-Point Dot-Product (Q1.31) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [63:0] t0; /* Calculate intermediate terms */ /* final result with saturation using MSBs for saturation logic */ endmodule |
module test_ppa (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [64:0] xxl; /* Full width dot-product and rounding */ /* final result with saturation done on full width of dot-product */ endmodule |
This example shows a signed fixed-point dot-product with inputs and output values in the range of ±1.0 Q(1.31). The implementation on the right side results in better performance, power, and area characteristics. While both implementations can produce results differing by ±1 LSB due to intermediate rounding (t2
and t3
), the right-side implementation achieves equivalent logic depth through architectural optimizations.
The RTL on the right side implements the following improvements to reduce logic depth:
The right-side implementation achieves <30 logic levels through:
Pipeline strategies for both implementations require careful consideration, implementation options to consider are:
2.1. Transformations for Datapath Optimization |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Carry-in is prioritized */ /* Input ctl is prioritized and after mapping after the comparison */ /* Adding constant 1 does not impact area or performance */ /* Subtract or a + ~b + 1 */ endmodule |
module test_ppa ( wire [17:0] x0_xxl; /* Carry-in merged into adder to speed up inputs operands "a" and "b" */ /* Input ctl is merged into the comparator */ /* Using 1's compliment */ /* Subtract: Path b is untouched */ endmodule |
The above RTL demonstrates various optimization techniques for merging arithmetic operations into a single Carry-Save Adder. While these equations can be highly beneficial, they require careful implementation.
x0
equation:
Synthesis often produces better results when inserting an explicit increment (carry-in) as an additional LSB.
Synthesis tools typically don't fully analyze carry-in signals, potentially prioritizing them at the expense of other adder inputs.
The right-side code enables additional control by replacing the constant 1'b1
with a control signal,
reducing logic depth by one level.
x1
equation:
Similar to x0
optimization, merging control signals into the datapath can enhance synthesis results
and enable subsequent optimization opportunities.
x2
and x3
equations:
Using 1's complement arithmetic enables arithmetic transformations that unlock additional optimization potential.
This approach allows more flexible logic restructuring, as demonstrated in the following examples.
2.2. Integer (16-bit) Multiply-Add-Sub Function |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( wire signed [31:0] xxl; /* Using if-statement to implement mul addsub function */ endmodule |
module test_ppa ( wire signed [31:0] xxl; /* Implementation of a mul-addsub function on an mac unit. */ endmodule |
Both implementations can be formally verified as equivalent through equivalence checking.
The example demonstrates how the less critical input c
can be combined with a multiply-add unit implementing the function c ± a * b
using a single CSA-tree structure. Special attention is required for sign extension when implementing signed arithmetic using unsigned operators, to ensure correct handling of negative values.
2.3. Implementing abs(a-b) on a Single Adder |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* abs(a-b): subtract larger value from the smaller (unsigned) */ endmodule |
module test_ppa ( wire signed [33:0] xxl; /* abs(a-b): On a single data-path using an adder of twice the width. */ /* extraction result from xxl. */ endmodule |
The function abs(a-b) can be implemented on a single adder of twice the width using transformation as described above: xxl = {a,1'b1,a} + {~b,1'b0,~b}
. Doubling the width of an adder increases the logic depth by only one extra AOI21 (or OAI21) gate level. In many cases, the extra XOR gate at the output of the adder can be merged into the following logic.
3.1. Logic, Arithmetic Shifts, and Rotations |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* behavioral code */ wire a_is_zero; assign a_is_zero = a==32'h0; assign zero = x==32'h0; always @(*) begin: pX_OV endmodule |
module test_ppa ( wire [32:0] x_ror; assign sh_lr = sh[4:0] ^ {5{sh_left}}; always @(*) begin: pMASK // calculating mask for result always @(*) begin: pREV_MASK // bit revers (no logic): rev_msk[32:0] = msk[0:32]; always @(*) begin: pX_OV assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] | s_ext; endmodule |
This example demonstrates a reference RTL implementation of logical/arithmetic shifts and rotations (left/right). The left-side RTL uses a direct behavioral description, while the optimized implementation on the right employs a unified approach: all shift operations are mapped to a single rotate-right operator paired with a dynamically generated mask. The final result is derived through a logic operation between the rotated value and the mask. To handle left/right directionality, the design uses a 1's complement inversion followed by a corrective shift. Beyond computational efficiency, this structure enables streamlined flag generation (e.g., zero, overflow) without increasing logic depth, as both the rotated value and mask inherently contain flag-relevant information.
3.2. SIN Generation for 18-bit Systems (Experimental) |
---|
DPI Code (C reference) |
Synthesizable RTL Code |
---|---|
module sin_dpi #( /* endmodule |
module sin_18 ( /* endmodule |
The sin-function's computation relies on optimized look-up tables tailored for size efficiency. A full comparison between the Verilog implementation and a C-reference model is conducted using the DPI interface to ensure bit-accurate matching of results.
Verilator serves as the simulation tool in the provided example.
The input range spans 18 bits (0.0 inclusive to π/2 exclusive), mapped to 18'h0_0000 to 18'h3_ffff. The output uses 19-bit unsigned UQ1.18 format (19'h0_0000 to 19'h4_0000), where the maximum value 19'h4_0000 represents 1.0 exactly. Precision is maintained through the formula:
x = max_val * sin((a*π)/(2.0*max_val))
where max_val = 218-1, verified against long double precision computations.
Optimization steps for area efficiency remain pending implementation.
3.3. Multiplier for Complex Numbers |
---|
RTL Code using 4 Multipliers |
Alternative Implementation using 3 Muliplipliers |
---|---|
module test_must ( /* behavioral code: (a+bi) * (c+di) = (ac−bd) + (ad+bc)i */ assign x = a * c - b * d; endmodule |
module test_ppa (
// alternative code (Gauss method) using 3 multiplications with extra additions wire signed [31:0] t1; assign t1 = a * c; /* optional pipeline stage */ assign x = t1 - t2; endmodule |
The alternative complex multiplication implementation (Gauss method) uses only 3 real multipliers but requires 5 additions. This increased computational dependency may require additional pipeline stages to maintain clock frequency compared to the 4-multiplier approach.