Can LLMs Generate FPGA-Ready Verilog?

FPGABench tests generated hardware designs beyond simulation, measuring whether they can actually synthesise, meet physical FPGA constraints, and optimise for a design objective.

Read Paper Explore Leaderboard

Total Problems 52

Top Feasible@1 82.7%

Top MOQ@1 Quality 1.12

Evaluated Models 13

Feasibility vs. Design Quality Frontier

Visualizing model capability. The Pareto frontier curve connects models achieving the optimal balance between functional feasibility and compilation/PPA quality.

Model Performance Pareto Frontier

FPGABench Leaderboard

Sort by columns to find the most cost-effective, high-quality, or functionally correct models.

Rank ↑	Model Name ↕	Developer ↕	License ↕	Feasible@1 ↕	MOQ@1 ↕	Details

What is FPGABench?

Unlike general code generation benchmarks (e.g., HumanEval) that only verify functional execution, FPGA Verilog development is tightly bound to hardware constraints. A design that runs correctly in simulation can still fail synthesis, exceed resource budgets, or run too slowly on a real chip.

FPGABench introduces 52 diverse, hand-crafted hardware design problems ranging from basic arithmetic blocks (multipliers, ALUs) to complex sequential circuits (FIFO controllers, SPI masters, AES pipelines). It automatically synthesizes generated code to evaluate post-routing hardware metrics.

Key Metrics Defined

Feasible@K Functional, Synthesis & Constraint Satisfaction

A candidate design $d$ is feasible for problem $p$ iff it is functionally correct, synthesisable, and meets all constraints. $\text{Feasible@}K$ is then the fraction of problems where model $m$ produces a feasible design in $K$ attempts:

$$\text{feasible}(d, p) = \mathbf{1}[\text{correct}(d, p)] \cdot \mathbf{1}[\text{synth}(d)] \cdot \prod_{j=1}^{N_c} \mathbf{1}[d \text{ meets constraint } c_p(j)]$$

$$\text{Feasible@}K = \frac{1}{|P|} \sum_{p \in P} \max_{1 \le i \le K} \text{feasible}(d^{(i)}, p)$$

MOQ@K (Maximum Objective Quality) PPA Optimization Index

$\text{Feasible@}K$ measures whether a model can produce a usable design, but not how well that design optimises the target hardware. To quantify synthesis quality among feasible designs, we define the Maximum Objective Quality (MOQ).

For each problem $p$, let $f_p^*$ denote the objective value achieved by the reference implementation. For a feasible generated design $d$, define its normalised quality $q_p(d)$:

$$q_p(d) = \begin{cases} \frac{f(d)}{f_p^*} & \text{if } \text{dir}_p = \;\uparrow (F_{\text{max}}) \\ \frac{f_p^*}{f(d)} & \text{if } \text{dir}_p = \;\downarrow (\text{LUTs}) \end{cases}$$

Here, $q_p(d) = 1$ means the generated design matches the reference, $q_p(d) > 1$ means it improves on the reference, and $q_p(d) < 1$ indicates underperformance.

Across a benchmark set $P$, $\text{MOQ@}K$ is defined as the geometric mean, over all problems, of the best feasible objective ratio found within the first $K$ attempts:

$$\text{MOQ@}K = \left( \prod_{p \in P} \max_{\substack{1 \le i \le K \\ \text{feasible}(d^{(i)}, p) = 1}} q_p(d^{(i)}) \right)^{\frac{1}{|P|}}$$

The geometric mean is the natural aggregator for ratio-valued scores: a design $2\times$ better and one $0.5\times$ worse correctly average to $1.0$ ($\sqrt{2 \times 0.5} = 1.0$), whereas the arithmetic mean would incorrectly report a net gain of $1.25$. $\text{MOQ@}K$ is conditional on feasibility and must be interpreted alongside $\text{Feasible@}K$.

Table 4.3: Scoring examples for the 7-Sample Median Denoising Filter

A detailed illustration of candidate evaluation for a specific hardware design target under Feasible@K and MOQ@K metrics.

Candidate ID	Functionally Correct	Synthesizable	Meets Constraints	LUT Usage	Feasible	Quality ratio ($q_p(d)$)
Candidate 1 (Optimal)	✅ Yes	✅ Yes	✅ Yes	80 (Reference: 100)	Yes	1.25 (Improves PPA)
Candidate 2 (Suboptimal)	✅ Yes	✅ Yes	✅ Yes	120 (Reference: 100)	Yes	0.83 (Underperforms)
Candidate 3 (Fails Constraint)	✅ Yes	✅ Yes	❌ No (Timing Violation)	70 (Reference: 100)	No	— (Infeasible)
Candidate 4 (Fails Function)	❌ No	✅ Yes	—	—	No	— (Infeasible)

Submit Your Model Results

We encourage researchers and developers to run FPGABench on new models. To add your model to the official leaderboard, please follow the steps below:

Clone the FPGABench Repository: git clone https://github.com/fpgabench/fpgabench.git

Run generation & synthesis suite: python run_eval.py --model your-model-name --eda_tool vivado

Submit a Pull Request: Upload your generated JSON result artifact under results/your_model.json.

View GitHub Repository Contact Committee