Can LLMs Generate FPGA-Ready Verilog?

FPGABench tests generated hardware designs beyond simulation, measuring whether they can actually synthesise, meet physical FPGA constraints, and optimise for a design objective.

Total Problems 52
Top Feasible@1 82.7%
Top MOQ@1 Quality 1.12
Evaluated Models 13

Feasibility vs. Design Quality Frontier

Visualizing model capability. The Pareto frontier curve connects models achieving the optimal balance between functional feasibility and compilation/PPA quality.

Model Performance Pareto Frontier

FPGABench Leaderboard

Sort by columns to find the most cost-effective, high-quality, or functionally correct models.

Rank Model Name Developer License Feasible@1 MOQ@1 Details

What is FPGABench?

Unlike general code generation benchmarks (e.g., HumanEval) that only verify functional execution, FPGA Verilog development is tightly bound to hardware constraints. A design that runs correctly in simulation can still fail synthesis, exceed resource budgets, or run too slowly on a real chip.

FPGABench introduces 52 diverse, hand-crafted hardware design problems ranging from basic arithmetic blocks (multipliers, ALUs) to complex sequential circuits (FIFO controllers, SPI masters, AES pipelines). It automatically synthesizes generated code to evaluate post-routing hardware metrics.

Key Metrics Defined

Feasible@K Functional, Synthesis & Constraint Satisfaction

A candidate design $d$ is feasible for problem $p$ iff it is functionally correct, synthesisable, and meets all constraints. $\text{Feasible@}K$ is then the fraction of problems where model $m$ produces a feasible design in $K$ attempts:

$$\text{feasible}(d, p) = \mathbf{1}[\text{correct}(d, p)] \cdot \mathbf{1}[\text{synth}(d)] \cdot \prod_{j=1}^{N_c} \mathbf{1}[d \text{ meets constraint } c_p(j)]$$
$$\text{Feasible@}K = \frac{1}{|P|} \sum_{p \in P} \max_{1 \le i \le K} \text{feasible}(d^{(i)}, p)$$
MOQ@K (Maximum Objective Quality) PPA Optimization Index

$\text{Feasible@}K$ measures whether a model can produce a usable design, but not how well that design optimises the target hardware. To quantify synthesis quality among feasible designs, we define the Maximum Objective Quality (MOQ).

For each problem $p$, let $f_p^*$ denote the objective value achieved by the reference implementation. For a feasible generated design $d$, define its normalised quality $q_p(d)$:

$$q_p(d) = \begin{cases} \frac{f(d)}{f_p^*} & \text{if } \text{dir}_p = \;\uparrow (F_{\text{max}}) \\ \frac{f_p^*}{f(d)} & \text{if } \text{dir}_p = \;\downarrow (\text{LUTs}) \end{cases}$$

Here, $q_p(d) = 1$ means the generated design matches the reference, $q_p(d) > 1$ means it improves on the reference, and $q_p(d) < 1$ indicates underperformance.

Across a benchmark set $P$, $\text{MOQ@}K$ is defined as the geometric mean, over all problems, of the best feasible objective ratio found within the first $K$ attempts:

$$\text{MOQ@}K = \left( \prod_{p \in P} \max_{\substack{1 \le i \le K \\ \text{feasible}(d^{(i)}, p) = 1}} q_p(d^{(i)}) \right)^{\frac{1}{|P|}}$$

The geometric mean is the natural aggregator for ratio-valued scores: a design $2\times$ better and one $0.5\times$ worse correctly average to $1.0$ ($\sqrt{2 \times 0.5} = 1.0$), whereas the arithmetic mean would incorrectly report a net gain of $1.25$. $\text{MOQ@}K$ is conditional on feasibility and must be interpreted alongside $\text{Feasible@}K$.

Table 4.3: Scoring examples for the 7-Sample Median Denoising Filter

A detailed illustration of candidate evaluation for a specific hardware design target under Feasible@K and MOQ@K metrics.

Candidate ID Functionally Correct Synthesizable Meets Constraints LUT Usage Feasible Quality ratio ($q_p(d)$)
Candidate 1 (Optimal) ✅ Yes ✅ Yes ✅ Yes 80 (Reference: 100) Yes 1.25 (Improves PPA)
Candidate 2 (Suboptimal) ✅ Yes ✅ Yes ✅ Yes 120 (Reference: 100) Yes 0.83 (Underperforms)
Candidate 3 (Fails Constraint) ✅ Yes ✅ Yes ❌ No (Timing Violation) 70 (Reference: 100) No — (Infeasible)
Candidate 4 (Fails Function) ❌ No ✅ Yes No — (Infeasible)

Submit Your Model Results

We encourage researchers and developers to run FPGABench on new models. To add your model to the official leaderboard, please follow the steps below:

1
Clone the FPGABench Repository: git clone https://github.com/fpgabench/fpgabench.git
2
Run generation & synthesis suite: python run_eval.py --model your-model-name --eda_tool vivado
3
Submit a Pull Request: Upload your generated JSON result artifact under results/your_model.json.