TeallSemi GmbH email: tealsemi@gmail.com HRB: 254216 (Munich) VAT: DE328043994 |
TealSemi |
EMAIL: tealsemi@gmail.com |
---|
TealSemi GmbH offers help with front-end RTL design at various stages:
Below are a few RTL code examples from recent RTL optimizations. All examples (including a basic test bench) can be downloaded and simulated. The downloaded file include a script for running simulation using Icarus as a simulation tool. The examples can be easily ported to any other Verilog simulator. Examples how to use Xcelium or Questa, please see comments in script run_sim, which is part of each downloaded package.
1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result)
1.2. Mixture of Signed and Unsigned Operations
1.3. Ordering of Multipliers and Adders
1.4. Rounding (e.g. for Q1.31)
1.5. Multiply-Add Operation for Q1.31
1.6. Multiply-Add Operation with Saturation for (Q1.31)
1.7. Pipelined Accumulator (Signed)
1.8. Signed Fixed-Point Dot-Product (Q1.31)
2.1. Transformations to Optimize Datapath
2.2. Integer (16 bits) Multiply Add Sub Function
2.3. Implementation of Function abs(a-b) on a Single Adder
3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left
3.2. SIN Geneation for 18 bits (EXPERIMENTAL)
1.1. Sign Extensions (example 32x16 bit integer multiplier with 32 bit result) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Signed integer multiplication executed on a unsigned multiplier */ endmodule |
module test_ppa ( wire signed [47:0] xxl; /* Signed integer multiplication executed on a signed multiplier */ endmodule |
Both sides are formally equivalent, but the code on the right side synthesizes to smaller area and higher speed. Most linting tools will issue a warning for the improved code (on the right side) because the bit width of the operands a and b are not equal. This warning needs to be waived. In terms of area the difference can be significant for multipliers but less significant for adders. The same applies for unsigned operations.
1.2. Mixture of Signed and Unsigned Operations |
---|
RTL Code |
Improved PPA |
---|---|
module test_must (
wire [16:0] ua; /* Arithmetic mapped to unsigned multiplier */ endmodule |
module test_ppa ( wire signed [32:0] xxl;
/* Signed arithmetic mapped to signed multiplier */ endmodule |
Both codes are formally equivalent. It is best to use signed arithmetic if one term of the arithmetic is signed (including the result). All unsigned terms need casting to signed terms. An unsigned multiplier has about the same size as an signed multiplier. The code on the right side is smaller and has less logic depth.
1.3. Ordering of Multipliers and Adders |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Addition is followed by multiplier */ endmodule |
module test_ppa ( /* Addition is merged into the multiplier (single CSA tree) */ endmodule |
Both codes are formally equivalent. The RTL on the right side has less level of logic and about the same area, because an addition followed by a multiplication (as coded on the left side) cannot be merged into a single Carry-Save-Adder.
1.4. Rounding (e.g. for Q1.31) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* Rounding is done on an adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [63:0] xxl; /* Rounding is merged into the multiplier (single CSA tree) */ endmodule |
Both codes are formally equivalent. In the RTL on the left side the bit width of the adder is minimized, but the improved RTL on the right side has less level of logic and less area because it can be mapped onto a single Carry-Save-Adder.
1.5. Multiply-Add Operation for Q1.31 |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [1:0] RND = 2'sd1; wire signed [63:0] xxl0; /* MAC operation and rounding is done on a separate adder with minimum bit width */ endmodule |
module test_ppa ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* MAC operation and rounding are mapped onto a single CSA tree */ endmodule |
Both codes are formally equivalent. As seen in the example above for rounding, the RTL on the right side can be mapped to a single Carry-Save-Adder and significantly improves PPA.
1.6. Multiply-Add Operation with Saturation for (Q1.31) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( localparam signed [32:0] RND = 33'sh0_8000_0000; wire signed [64:0] xxl; /* Multiply-Add followed by saturation logic */ endmodule |
module test_ppa (
localparam signed [64:0] MAX_VAL = 65'sh0_7fff_ffff_ffff_ffff; wire signed [64:0] xxl; /* Control for saturation is mapped into the CSA tree */ endmodule |
The code on both sides is formally equivalent. In many cases of fixed-point arithmetic, saturation logic needs to be added to protect an overflow of the Multiply-Add operation. If the condition of the overflow is using a magnitude comparator (see right side) of the full width (see term xxl) then the control for the multiplexer of the saturtion logic will be merged into the Carry-Save-Adder. This will result in less level of logic. For an ultra low-speed implementation this might cause a slight increase in area.
1.7. Pipelined Accumulator (Signed) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* register stage: accumulator */ always @( posedge clk or negedge rst_n ) begin endmodule |
module test_ppa ( integer i; wire signed [63:0] aa; /* sign-extend input a */ assign aa = $signed(a); /* Register stage: output of 3-2 compressor (only one single full-adder stage) */ always @( posedge clk or negedge rst_n ) begin /* 32 bit adder is outside critical look and can be pipelined if needed */ assign x = $signed(s) + $signed(c[63:0]); endmodule |
The code above demonstrates how a compressor (Wallacetree) can be used if timing cannot be closed in an accumulator stage. On the left side the input "a" is accumulated every clock cycle to register x (output). If timing cannot be met, one way to add pipeline stages and still accumulate every clock cycle is to register the CSA stage and move the final adder outside the accumulator stage. Once the adder is outside, it can be pipelined, which increases the latency without breaking the capability of single cycle accumulation. The RTL on the right side needs only a single full adder stage in the feedback loop, so the accumulation can run at extremely high speed. This comes at the cost of twice as many registers to keep the accumulated value. The more time consuming operation is outside the accumulator loop and can easily be pipelined as needed. The example above is a very simple demonstration how to split a CSA-tree. It could also involve more complex arithmetic including multipliers. An example of a use case would be a noise-shaper.
1.8. Signed Fixed-Point Dot-Product (Q1.31) |
---|
RTL Code |
Improved PPA |
---|---|
module test_must (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [63:0] t0; /* Calculate intermediate terms */ /* final result with saturation using MSBs for saturation logic */ endmodule |
module test_ppa (
localparam signed [31:0] MAX_VAL32 = 32'sh7fff_ffff; wire signed [64:0] xxl; /* Full width dot-product and rounding */ /* final result with saturation done on full width of dot-product */ endmodule |
This example shows a signed fixed-point dot-product with inputs and output values in the range of ±1.0 Q(1.31). The implementation on the right side results in better performance, power, area and the result is more accurate. The output between both implementations can differ by ± 1 LSB. The difference is due to rounding of the intermediate terms (t2 and t3). In many cases there will be a register stage for t2 and t3 to break the critical path and meet timing requirements. The logic depth for the complete dot-product on the right side is actually about the same as the logic depth for the intermediate terms t2 and t3..
The RTL on the right side implements the following improvements to reduce logic depth:
The dot-product on the right side will have less than 30 levels of logic and in most cases can be implemented without any pipeline registers. Adding pipeline registers on the left side is straight forward, because intermediate terms are available. But it is important to realize that the 32 bit multiplication on the left side has only about 3 levels of logic less than the entire dot-product on the right side. Moving the saturation logic into the next cycle would save one extra level, which might not be sufficient.
These are the options for adding a pipeline stage:
2.1. Transformations to Optimize Datapath |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* Carry-in is prioritized */ /* Input ctl is prioritized and after mapping after the comparison */ /* Adding constant 1 does not impact area or performance */ /* Subtract or a + ~b + 1 */ endmodule |
module test_ppa ( wire [17:0] x0_xxl; /* Carry-in merged into adder to speed up inputs operands "a" and "b" */ /* Input ctl is merged into the comparator */ /* Using 1's compliment */ /* Subtract: Path b is untouched */ endmodule |
The above RTL could be used for optimizations to merge Arithmetic onto a single Carry-Save-Adder. The equations can in some cases be very useful but need to be applied very carefully.
2.2. Integer (16 bits) Multiply Add Sub Function |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( wire signed [31:0] xxl; /* Using if-statement to implement mul addsub function */ endmodule |
module test_ppa ( wire signed [31:0] xxl; /* Implementation of a mul-addsub function on an mac unit. */ endmodule |
Both codes are formally equivalent. The example shows how the slightly less critical input c can be manipulated to implement the function: c ± a * b onto a multiply add unit. In the example attention needs to be paid to sign extension, because unsigned arithmetic is used to implement signed arithmetic.
2.3. Implementation of Function abs(a-b) on a Single Adder |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* abs(a-b): subtract larger value from the smaller (unsigned) */ endmodule |
module test_ppa ( wire signed [33:0] xxl; /* abs(a-b): On a single data-path using an adder of twice the width. */ /* extraction result from xxl. */ endmodule |
The function abs(a-b) can be implemented on a single adder of twice the width using transformation as described above. Doubling the width of an adder increases the logic depth only by one exta AOI21 (or OAI21) gate. In many cases the extra EXOR gate at the output of the adder can be merged into the following logic.
3.1. Logic, Arithmetic Shift Right and Left and Rotate Right and Left |
---|
RTL Code |
Improved PPA |
---|---|
module test_must ( /* behavioral code */ wire a_is_zero; assign a_is_zero = a==32'h0; assign zero = x==32'h0; always @(*) begin: pX_OV endmodule |
module test_ppa ( wire [32:0] x_ror; assign sh_lr = sh[4:0] ^ {5{sh_left}}; always @(*) begin: pMASK // calculating mask for result always @(*) begin: pREV_MASK // bit revers (no logic): rev_msk[32:0] = msk[0:32]; always @(*) begin: pX_OV assign x = sh_left? x_ror[32:1] & ~msk[32:1] : x_ror[31:0] & msk[31:0] | s_ext; endmodule |
This is an example implementation of logic or arithmetic shift and rotate function (left and right). The RTL on the left side is based on a straightforward behavioural description. The RTL on right minimizes the levels of logic, by mapping all shift operations on a single rotate right operator and in parallel calculating a mask. The result of the shift operation is then simply a logic operation between rotate and the mask. The left/right operation uses a 1'compliment and is then corrected by one extra shift. In addition to the result of the shift operation the rotate and mask allows a simple calculation of flags (zero and overflow) without adding to the logic depth.
3.2. SIN Geneation for 18 bits (EXPERIMENTAL) |
---|
DPI Code (C reference) |
Synthesizable RTL Code |
---|---|
module sin_dpi #( /* endmodule |
module sin_18 ( /* endmodule |
This code is currently available as a pre-release upon request. It has successfully passed regression tests and functions reliably.
However, some essential steps for area optimization are pending and a pipeline-stages between the look-up tables (LUT) stage and the arithmetic could
be beneficial.
The sin-function's computation relies on optimized look-up tables tailored for size efficiency. A full comparison
between the Verilog implementation and a C-reference model is conducted using the DPI interface to make sure the result of the RTL
matches the reference model for each output bit.
Verilator serves as the simulation tool in the provided example.
The input range for the sin function spans 18 bits, precisely from 0.0 (inclusive) to π/2 (exclusive), evenly distributed across
18'h0_0000 to 18'h3_ffff. Correspondingly, the output adheres to a 19-bit unsigned format, UQ1.18. Notably, the maximum achievable
result is 1.0. Consequently, the RTL's output spans between 19'h0_0000 and 19'h4_0000. The precision is maintained,
ensuring bit-accurate results. Every output aligns perfectly with computations done using long double accuracy, following the
formula: x = max_val * sin((a*π) / (2.0*max_val)), where max_val equals 2**18-1.
Despite its functionality and accuracy, it's important to note that optimization steps aimed at enhancing area efficiency are yet to be integrated.