On The State of Digital Logic Design
Imagine if everytime you wanted to deploy the most basic of websites you had to:
- hire a computer building expert to build a custom PC
- hire an expert in networking to get your PC online
- hire an expert in installing Linux onto the PC
- hire an expert in install Python
- hire an expert in installing Flask
That would be utter insanity. Deploying a website is commonly known to be a one person task. No wonder so many software startups find success with relative ease in products and services that can use websites as a vehicle.
The above process is not so foreign to designing complete digital systems on chips. It’s not that its impossible to make the process better, but few have tried.
The Common Purely Logical Chip Components
- PCI-e controllers
- USB controllers
- SATA controllers
- DRAM controllers
- AXI Crossbars
The Common Purely Logical Primitives
- ASYNC FIFO’s
- floating point units
- pipelined multipliers
- Re-order buffers
- Hazard detectors
- Mesh Networks
Right now, if you wish to build a competitive SOC, you’ll likely want it to at least have the controller-PHY pairs for PCI-e, SATA, USB, and DRAM. You’ll likely be acquiring these design pairs from multiple different silicon IP companies such as Synopsys, Cadence, Rambus, Corigine, or ARM in a long drawn out sales process, if they even decide to talk to you.
This process sort of feels like the ridiculous aforementioned process for building a website.
What I wish Existed
Controllers can almost always be designed and reasoned about in a purely digital domain. What I wish for is a digital design language(RTL? HDL?) that allowed you to create designs with the ease of most software developers experience when building, websites, phone applications, desktop applications, and even programming IOT devices and heavy machinery. Consider the simple snippet for deploying a flask website:
from flask import Flask app = Flask(__name__) @app.route('/') def index(): ...
What if for digital logic design there existed the following?
import cpu, axi_bar, ddr3, xhci my_cpu = cpu() x_bar = axi_bar() ram_controller = ddr3.controller() usb = xhci.usb() my_cpu <> x_bar x_bar <> ram_controller x_bar <> usb
Today’s top digital design languages, namely Verilog and VHDL aren’t even remotely capable of supporting libraries(in the modern sense), let alone the abstractions needed to intelligently parameterize components on instantiation and connection.
LiteX actually comes incredibly close to providing the aforementioned experience. LiteX’s biggest flaw is its polygot stack. LiteX can require interfacing with Verilator and Xilinx tools. The mechanisms it provides to achieve such interfacing was designed as an afterthought. LiteX’s controllers are also tied to certain PHY’s in a proprietary way making LiteX hardly future proof. I imagine its possible to design controllers such that they are portable amongst PHYs, but this is no easy feat, and really, LiteX deserves credit for even achieving its level of functionality. LiteX’s other flaw is that it is based on Migen.
SpinalHDL also does a remarkable job of delivering the aforementioned experience and even comes with a DDR2/3 controller as well as a USB controller. Unfortunately, Spinal requires Scala-Build-Tool build tool which is remarkably bad at performing offline builds. Scala also runs on the JVM which ranges from 200MB to 1GB in size, and warms up slowly when deploying unit tests in containers
There is, however, far more out there in the RTL world than SpinalHDL or LiteX. Below, I collect my thoughts on the minimum requirements for a good RTL.
Requirements for a Good RTL
- only two primitives: wires that function as direct connections and registers that introduce a propagation delay
- support for general purpose programming
- first class support for multiple clocks
- easy to write testbenches
- low-latency simulation engines for rapid iteration on smaller modules
- first-class support for formal-verification
- speedy RTL builds
- easy to bringup build environment
- expressible in an un-ambiguous RTL format
- low learning curve
- core lib comprised of parameterizable floating-point-units(FPUs), multipliers, FIFOs, AFIFOS, rdy-valid pipelines, arbiters, and crossbars
- extended library comprised of USB, PCI-e, ethernet, and DRAM controllers
Here I compare what I consider the major RTLs. I don’t really include BlueSpec on this list since I think BlueSpec has a needlessly high learning curve.
|Language||First Appeared||Core Libs(AFIFO, PIPED Multiplier, FPUs, etc.)||Extended Libs(AXI, USB Controller, DRAM Controller, etc…)||Semantically Clear||Meta Programming||Good Test Bench Tooling||Notable Strengths||Notable Weaknesses|
|System Verilog||2002||No||No||No||No||Presumably better than Verilog|
|MyHDL||2003||No||No||poor v*HDL conversion semantics||Yes||Somewhat|
|Chisel||2010||Minimal||Tilelink provides interconnect and caches, but in practice isn’t usable outside of SiFive. The TileLink codebase is spaghetti.||Chisel seems to have inconsistent rules around accessing types during simulation.||Yes||Can’t inspect memory in simulation, can’t write enums to waveforms||fast-cosimulation||Requires JVM/SBT|
|migen||2011||Somewhat||Yes||Yes||Yes||Yes||Clear integer arithmetic rules.||No hierarchical Verilog emission.|
|nMigen||2018||Somewhat||Not yet||Yes||Yes||Yes||Clear integer arithmetic rules.|
Overall Thoughts on RTLs
I consider anything from before 2010 on that above list, simply a non-starter.
One phrase I would use to describe Chisel is “just gets in your way”. SpinalHDL is in my opinion the most productive RTL. Reading through the SpinalHDL source code and design, you can tell it was put together by a passionate individual who understood what they were doing. nMigen comes in right after SpnialHDL as a close second.
nMigen is also tightly integrated with yosys+nextPNR(an open source synthesizer and place and router respectively), allowing nMigen to automagically program a design directly into an FGPA. No intermediate Verilog or VHDL code is ever generated in the process.
Below is one such a snippet for programming an FPGA in nMigen:
from nmigen import * from nmigen_boards.tinyfpga_bx import * class Blinky(Elaboratable): def elaborate(self, platform): user_led = platform.request("led", 0) counter = Signal(23) m = Module() m.d.sync += counter.eq(counter + 1) m.d.comb += user_led.o.eq(counter[-1]) return m if __name__ == "__main__": platform = TinyFPGABXPlatform() platform.build(Blinky(), do_program=True)
Unfortunately, none of the above RTLs meet my requirements for a good RTL. SpinalHDL as close as it gets is still tied to SBT and the JVM resulting in a clunky build/dev environment.
The solution is clear, I must design a new RTL that has a lightweight build environment, a strong core library, and a robust extended library replete with all the controllers you could ever ask for.
This RTL will provide a polished experience. In fact, I plan to call this RTL Polished.
Issues With Verilog
You could probably stop reading here, but maybe you’re curious about all of Verilog’s issues. Thanks to WhiteQuark for helping me compile this list.
- Most of its constructs are not synthesizable.
- A composition of synthesizable constructs may not be synthesizable.
- Simluation semantics is inherently and deliberately nondeterministic.
- "Improved" SystemVerilog features still have severe defects, e.g.
always_combis supposed to fix the problem of
always @not triggering at time 0 (which caused a sim/synth mismatch), and it does that, but introduces the problem of missed triggers in
always_comb begin a = b b = c end(if
awill end up wrong in simulation).
- SystemVerilog doesn’t even *try* to define which constructs are
synthesizable in first place. So,
always_sync? Those are 100% implementation defined and essentially non-portable.
- Even though Verilog coding styles that avoid problems with e.g.
blocking/nonblocking assignments exist, they have many edge cases
and cannot be applied mechanically. E.g. clock gating circuits must
use blocking assignment in an always
- Basic arithmetics has extremely surprising behavior, in particular around integer promotion. Signed + unsigned gives unsigned, width of an expression depends not only on the expression but on context in which it is used.
- SystemVerilog is a massive standard (which essentially no vendor implements in full), and it offers no way to subset it to be able to claim compliance meaningfully.
- Using memories in a portable way requires relying on inference, which cannot happen either on syntax level (or you would restrict coding style too much), or on netlist level (or you would miscompile some inputs). E.g. a synchronous, transparent read port is expressed using an idiom that combines an asynchronous read port with registered address. However these have different semantics. If you actually need an asynchronous read port, but your netlist happens to drive it with a register (which may be on a completely different level of hierarchy) then you will get a miscompilation.
- Conflation of ‘x meaning "timing violation", ‘x meaning
"uninitialized register/memory" and ‘x meaning "this value left
open for optimization" means that perfectly correct (even formally
verified) modules can be miscompiled (and produce seemingly
impossible results like
a && !a) if they are fed a ‘x through the ports. (There is no way to avoid this with commercial synthesizers).
- "Structural Verilog" doesn’t exist but many tools claim to generate or consume it. They are not compatible with each other.
generateis both highly complicated to implement (which means it is often not supported well), and restricted in the amount of logic it can produce, meaning people resort to preprocessing with perl anyway.
- Even though lots of tools generate Verilog, there is no standard way to serialize location info beyond crude preprocessor directives, meaning all that generated Verilog is extremely hard to debug.
- API for interacting with the outside world is extremely painful: you have a choice between crude and non-portable stdio bindings, and DPI-C, which is unsafe and crashes a lot.
- Verilog is a language for simulating concurrent logic yet it has no first-class concept of a clock, nor any way to detect race conditions. (Inherent nondeterminism means something like TSAN is not generally viable.)
- The standard waveform dump format is extremely limited. E.g. no way to determine the sign of a signal, or symbolize enums.
- No standard library, or portable way to mark clock domain crossing.