Can LLMs Generate FPGA-Ready Verilog?
FPGABench tests generated hardware designs beyond simulation, measuring whether they can actually synthesise, meet physical FPGA constraints, and optimise for a design objective.
Feasibility vs. Design Quality Frontier
Visualizing model capability. The Pareto frontier curve connects models achieving the optimal balance between functional feasibility and compilation/PPA quality.
FPGABench Leaderboard
Sort by columns to find the most cost-effective, high-quality, or functionally correct models.
| Rank ↑ | Model Name ↕ | Developer ↕ | License ↕ | Feasible@1 ↕ | MOQ@1 ↕ | Details |
|---|
What is FPGABench?
Unlike general code generation benchmarks (e.g., HumanEval) that only verify functional execution, FPGA Verilog development is tightly bound to hardware constraints. A design that runs correctly in simulation can still fail synthesis, exceed resource budgets, or run too slowly on a real chip.
FPGABench introduces 52 diverse, hand-crafted hardware design problems ranging from basic arithmetic blocks (multipliers, ALUs) to complex sequential circuits (FIFO controllers, SPI masters, AES pipelines). It automatically synthesizes generated code to evaluate post-routing hardware metrics.
Key Metrics Defined
A candidate design $d$ is feasible for problem $p$ iff it is functionally correct, synthesisable, and meets all constraints. $\text{Feasible@}K$ is then the fraction of problems where model $m$ produces a feasible design in $K$ attempts:
$\text{Feasible@}K$ measures whether a model can produce a usable design, but not how well that design optimises the target hardware. To quantify synthesis quality among feasible designs, we define the Maximum Objective Quality (MOQ).
For each problem $p$, let $f_p^*$ denote the objective value achieved by the reference implementation. For a feasible generated design $d$, define its normalised quality $q_p(d)$:
Here, $q_p(d) = 1$ means the generated design matches the reference, $q_p(d) > 1$ means it improves on the reference, and $q_p(d) < 1$ indicates underperformance.
Across a benchmark set $P$, $\text{MOQ@}K$ is defined as the geometric mean, over all problems, of the best feasible objective ratio found within the first $K$ attempts:
The geometric mean is the natural aggregator for ratio-valued scores: a design $2\times$ better and one $0.5\times$ worse correctly average to $1.0$ ($\sqrt{2 \times 0.5} = 1.0$), whereas the arithmetic mean would incorrectly report a net gain of $1.25$. $\text{MOQ@}K$ is conditional on feasibility and must be interpreted alongside $\text{Feasible@}K$.
Table 4.3: Scoring examples for the 7-Sample Median Denoising Filter
A detailed illustration of candidate evaluation for a specific hardware design target under Feasible@K and MOQ@K metrics.
| Candidate ID | Functionally Correct | Synthesizable | Meets Constraints | LUT Usage | Feasible | Quality ratio ($q_p(d)$) |
|---|---|---|---|---|---|---|
| Candidate 1 (Optimal) | ✅ Yes | ✅ Yes | ✅ Yes | 80 (Reference: 100) | Yes | 1.25 (Improves PPA) |
| Candidate 2 (Suboptimal) | ✅ Yes | ✅ Yes | ✅ Yes | 120 (Reference: 100) | Yes | 0.83 (Underperforms) |
| Candidate 3 (Fails Constraint) | ✅ Yes | ✅ Yes | ❌ No (Timing Violation) | 70 (Reference: 100) | No | — (Infeasible) |
| Candidate 4 (Fails Function) | ❌ No | ✅ Yes | — | — | No | — (Infeasible) |
Submit Your Model Results
We encourage researchers and developers to run FPGABench on new models. To add your model to the official leaderboard, please follow the steps below:
git clone https://github.com/fpgabench/fpgabench.git
python run_eval.py --model your-model-name --eda_tool vivado
results/your_model.json.