Private, Systematic LLM Evaluation

Q: Do I need technical skills to use BeLLMark?

No programming knowledge required. BeLLMark has a clean web interface. Installation requires basic command-line familiarity (Docker or Python/Node.js).

Q: How do I install BeLLMark?

Three installation options: Docker Compose (recommended, one command), manual setup with Python backend + Node.js frontend, or production build served from a single backend process. Typical setup time: 5-10 minutes.

Q: What LLM providers are supported?

BeLLMark supports OpenAI, Anthropic, Google, Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models.

Q: How do updates work?

All updates are free for life including future major versions. Pull the latest code from GitHub whenever a new version is released. No subscription fees, no forced upgrade cycles.

Q: Can we use human raters alongside LLM judges?

Results are fully exportable (HTML, JSON, CSV, PPTX, PDF) for human review. Native human evaluation workflows are on the roadmap.

Self-hosted benchmarking studio for comparing language models with blind A/B/C testing, configurable judges, and statistical analysis. No SaaS middleman — your prompts go directly to the providers you choose, or stay fully local with LM Studio.

See How It Works View Pricing

Why BeLLMark

8 models. 25 prompts. 23 minutes — blind-tested, statistically ranked, report ready.

"How do we know we're selecting the right model?"

Justify the decision

You're putting an AI model into production and someone will ask why you picked it. BeLLMark gives you a blind comparison with confidence intervals, bias checks, and an exportable report you can attach to a sign-off.

Watch how it works →

"Which model is best for us?"

Find the best fit

Five models could work for your use case. Marketing benchmarks won't tell you which one handles your prompts, documents and data best. BeLLMark runs your actual workload against all of them and ranks the results.

"Which model gives the best value for quality, speed and cost?"

Find the sweet spot

Can a free local model do 90% of what a cloud flagship does? Is the faster model good enough? BeLLMark shows you the quality-cost-speed tradeoff per model so you pick with data, not vibes.

How It Works

Configure Models

Add API keys for OpenAI, Anthropic, Google, or local LM Studio endpoints.

→

Define Questions

Write custom prompts or use AI to generate domain-specific test cases.

→

Run Benchmark

Models generate responses in parallel, then judges evaluate them blindly.

→

Analyze Results

View charts, ELO rankings, statistical significance tests, and bias analysis. Export to HTML, JSON, CSV, PPTX, or PDF.

Everything you need for systematic LLM evaluation

Blind A/B/C Testing

Compare up to 6 models at once. Responses shuffled and labeled anonymously before judging, with revealed mappings only after scoring. Eliminates model bias and ensures objective evaluation.

LLM-as-Judge

Use one or more language models as judges with customizable criteria. Choose separate scoring or direct comparison modes.

AI Criteria Generation

Let an LLM design evaluation rubrics for your specific use case, or write custom scoring criteria from scratch.

9 LLM Providers

OpenAI, Anthropic, Google, Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. Add your own providers easily.

Statistical Analysis

Bootstrap CI, Wilcoxon significance tests, Cohen’s d effect sizes, ELO ratings, bias detection, and judge calibration — all built into the dashboard.

Rich Exports

Export to HTML reports, JSON, CSV, consulting-grade PPTX, or PDF — all formats include statistical summaries and confidence intervals.

Evaluation methodology you can defend

Enterprise-grade statistical rigor from blind evaluation to bias detection

Blind Evaluation

Responses are shuffled and assigned blind labels (A, B, C...). Judges evaluate without knowing which model produced which response. Mapping is revealed only after all scoring is complete.

Transparent Rubrics

AI generates evaluation criteria from your use case description, or write custom rubrics from scratch. You review and approve before any benchmark runs. Your criteria, your standards.

Auditable Reasoning

Every judge score includes written reasoning. Expand any result to read exactly why a judge scored a response the way it did. Multi-judge mode provides confidence through agreement.

Configure
Models

→

Shuffle
& Blind

→

Judge
Evaluation

→

Score
Aggregation

→

Statistical
Analysis

→

Export
Report

Scoring System

Each response is scored 1–10 per criterion with defined anchors, then aggregated via weighted arithmetic mean.

Scores follow a three-tier scale: 1–3 (poor), 4–6 (acceptable), 7–10 (good to excellent). You define the criteria that matter for your use case — accuracy, completeness, tone, or any custom dimension. Each criterion carries a configurable weight.

Aggregation follows a three-stage pipeline: per-criterion scores are averaged across judges, then weighted by criterion importance, then averaged across questions to produce a final model score.

Confidence & Significance

Statistical tests tell you whether score differences are real or could be due to chance.

Wilson Score CI — confidence intervals on win rates that work correctly even with small sample sizes (unlike normal approximation).

Bootstrap CI — 1,000-resample confidence intervals on score differences between models, giving you a credible range for "how much better is Model A than Model B."

Wilcoxon Signed-Rank — non-parametric pairwise significance test. Doesn't assume scores are normally distributed. Reports p-values for every model pair.

Holm–Bonferroni Correction — adjusts p-values when comparing multiple models simultaneously, preventing false positives from repeated testing.

Statistical Power Analysis — estimates whether your sample size is large enough to detect meaningful differences, so you know if "not significant" means "no difference" or "not enough data."

Effect Sizes & Comparisons

Beyond "is it significant?" — how large is the difference, and how do all models rank together?

Cohen's d — standardized effect size for every model pair. Labeled as small (0.2), medium (0.5), or large (0.8+) so you can judge practical significance, not just statistical significance.

Pairwise Comparison Matrix — every model compared against every other model with significance indicators, effect sizes, and confidence intervals in a single table.

Friedman Test — non-parametric test for overall ranking significance across all models simultaneously.

Bias Detection

Automatically detects four types of evaluation bias that could undermine your results.

Position Bias — does the order in which responses are presented affect scores? Detected by comparing scores across presentation positions (A vs B vs C).

Length Bias — do longer responses get higher scores regardless of quality? Measured via Spearman rank correlation (ρ) between response length and score.

Self-Preference Bias — does a judge model favor responses from itself or its own provider? Flagged when detected.

Verbosity Bias — related to length bias but focused on token count relative to content density. LC (length-controlled) win rates correct for this by adjusting scores for response length.

Judge Calibration

Measures how consistent and reliable your judges are, both individually and as a group.

Cohen's Kappa (2 judges) / Fleiss' Kappa (3+ judges) — inter-rater agreement beyond what would be expected by chance. Values above 0.6 indicate substantial agreement; below 0.4 suggests judges are evaluating differently.

ICC (Intraclass Correlation) — measures the consistency of absolute score values across judges, not just ranking agreement.

Per-Judge Reliability — individual reliability scores identify if one judge is consistently an outlier, so you can investigate or remove unreliable judges.

ELO Rating System

Track model performance across multiple benchmark runs with an adaptive rating system.

Bayesian Adaptive K-Factor — new or uncertain models have higher K-factors (ratings change more per run), while established models have lower K-factors (ratings are more stable). This means a single anomalous result won't dramatically change a well-established model's ranking.

Cross-Run Tracking — ratings update automatically when any benchmark run completes. The global leaderboard reflects cumulative performance across all your evaluations.

Rating History — per-model rating history charts show how a model's ranking has evolved over time as more benchmarks are run.

Read the complete methodology document →
EU AI Act: How structured evaluation supports deployer obligations →

See BeLLMark in action

Real results from a completed benchmark — 8 models, 25 analytical reasoning questions, 3 independent judges. Explore the full analysis below.

Watch a full benchmark run

Or explore a real result yourself

Detailed results & per-criterion scoring

Full benchmark overview with model rankings, token usage, and cost breakdowns — plus granular scores by criterion with confidence intervals and win rates.

Benchmark overview

Evaluation scores

Ships with 5 research-backed benchmark suites

Ready-to-use evaluation sets covering reasoning, writing, compliance, calibration, and domain expertise. Import them with one click.

Analytical Reasoning

Multi-step logic, probability, scheduling

25 questions5 criteria

Instruction Compliance

Format constraints, word counts, prohibited elements

25 questions5 criteria

Long-form Writing

Essays, reports, audience calibration

25 questions5 criteria

Epistemic Calibration

Factual accuracy, hedging, confabulation

25 questions4 criteria

Domain Expert Communication

Technical accuracy across medicine, law, engineering

25 questions5 criteria

Built for teams who need to make informed AI decisions

Compliance & Legal

Evaluate LLM accuracy on legal reasoning, contract analysis, and regulatory interpretation without sending client data to external benchmarking services.

Test contract summarization accuracy
Compare legal reasoning capabilities
Validate compliance advisory quality
Air-gapped option with local models

EU AI Act evaluation requirements →

AI Consultants

Provide clients with objective, data-driven model recommendations backed by systematic benchmarking on their specific use cases.

Generate client-specific test cases
Deliver professional HTML reports
Compare cost vs. performance tradeoffs
Justify model selection decisions

Engineering Teams

Make informed decisions about which LLM to use in production by testing on real prompts before committing to API contracts.

Test local vs. cloud model quality
Validate prompt engineering changes
Compare reasoning model performance
A/B test prompt templates

Three ways to use BeLLMark

Same software. Your license depends on how you’ll use it and what evidence level you need.

Non-commercial

Free forever

€0

PolyForm Noncommercial 1.0.0

All current and future features
Self-hosted on your infrastructure
9 LLM provider integrations
Blind A/B/C evaluation
Statistical analysis & ELO ratings
HTML / JSON / CSV / PPTX / PDF exports
Community support on GitHub

Clone from GitHub →

Personal, educational, research, and non-commercial use. Upgrade when you commercialise.

One-time purchase

€799 €499 introductory pricing — first 60 days

per legal entity · unlimited users · no subscription

Everything in Non-commercial
Unlimited users within one legal entity
Use in client work, internal business decisions, revenue activity
Free updates for life
Best-effort email support
30-day money-back guarantee

Buy via Creem →

Enterprise

Multi-entity license

€2,999

one-time · flat rate · unlimited users

Everything in Commercial
4+ legal entities under one license
Unlimited users across all entities
All current and future features
Free updates for life
Best-effort email support
1-hour onboarding session with the creator

Buy Enterprise →

For organizations deploying BeLLMark across multiple subsidiaries, divisions, or legal entities.

Creem handles invoicing — European Merchant of Record, EU-invoice-compliant, VAT-handled in 40+ countries. 30-day money-back guarantee · support@bellmark.ai

Frequently Asked Questions

What does "per legal entity" mean?

One license covers unlimited users within a single legal entity (corporation, LLC, nonprofit, etc.). If you have multiple subsidiaries or separate legal entities, each needs its own license. Freelancers and sole proprietors need one license for their business use.

Is BeLLMark open source?

BeLLMark is source-available under the PolyForm Noncommercial 1.0.0 license. You can view, modify, and use the code for free for personal, educational, and non-commercial purposes. Commercial use requires a paid license (€799 one-time per legal entity, €499 introductory during the first 60 days). See our licensing terms for what the commercial license covers.

Do I need technical skills to use BeLLMark?

No programming knowledge required! BeLLMark has a clean web interface. If you can use a web browser and have API keys for LLM providers (like OpenAI or Anthropic), you can run benchmarks. Installation requires basic command-line familiarity (Docker or Python/Node.js).

How do I install BeLLMark?

Three installation options: (1) Docker Compose (recommended, one command), (2) Manual setup with Python backend + Node.js frontend, or (3) Production build served from a single backend process. Full instructions in the GitHub repository. Typical setup time: 5-10 minutes.

What LLM providers are supported?

BeLLMark supports OpenAI (GPT-4, GPT-5, o1), Anthropic (Claude Opus/Sonnet/Haiku), Google (Gemini 2.5/3), Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. The architecture is modular—adding new providers is straightforward by implementing the OpenAI-compatible endpoint pattern.

How do updates work?

All updates are free for life — including future major versions. Pull the latest code from GitHub whenever a new version is released. No subscription fees, no forced upgrade cycles, no license keys to manage. Your commercial license covers all future features and improvements.

What about my API keys and data privacy?

BeLLMark runs entirely on your infrastructure with zero telemetry — no analytics, no phone-home, no BeLLMark-operated cloud. API keys are encrypted at rest in your local SQLite database. When using cloud LLM providers (OpenAI, Anthropic, etc.), prompts go directly from your server to the provider's API — never through BeLLMark. For fully air-gapped operation where nothing leaves your network, use local models via LM Studio. This architecture supports your compliance goals for frameworks like GDPR and HIPAA — actual certifications depend on your infrastructure setup and LLM provider agreements. Contact us for framework-specific guidance.

How does LLM-as-judge work and how do you validate it?

BeLLMark sends each model's response to a judge LLM along with your evaluation criteria. The judge scores each response on a 1-10 scale per criterion, providing written reasoning for each score. Responses are presented with blind labels (A, B, C) so the judge doesn't know which model produced which response. For validation, use multi-judge mode (multiple LLMs evaluate independently) and check agreement. Multi-judge mode and statistical calibration ensure reliable, reproducible results.

Can we use human raters alongside LLM judges?

Not yet as a built-in feature, but BeLLMark's results are fully exportable (HTML, JSON, CSV, PPTX, PDF) for human review. The recommended workflow: run LLM-as-judge for initial screening, then export the top candidates' responses for human evaluation. Native human evaluation workflows with integrated scoring are on our roadmap.

How do you handle rate limits, failures, and retries?

BeLLMark automatically retries failed API calls up to 3 times with progressive backoff (2s, 5s, 10s delays). It checkpoints before phase transitions (generation → judging) so partial progress is preserved. If an API call fails after all retries, the specific failure is logged and a manual retry button appears in the progress view. Other models and questions continue processing normally.

Do you support role-based access or multiple workspaces?

BeLLMark currently runs as a single-user application. For team use, we recommend deploying behind your existing authentication (VPN, reverse proxy with SSO, or network-level access control). Multi-user support with role-based access and team workspaces is on our roadmap. All benchmark data is stored in a single SQLite database that can be shared across the team.

How does BeLLMark compare to other evaluation tools?

vs. CLI tools (e.g., Promptfoo): BeLLMark provides a visual web interface with blind A/B/C evaluation, real-time progress, and consulting-grade export formats (PPTX, PDF). No YAML configuration required.

vs. Public leaderboards (e.g., Chatbot Arena): BeLLMark runs on your infrastructure with your own questions and criteria — no third-party benchmarking service ever sees your prompts.

vs. LLMOps platforms: BeLLMark is a focused evaluation studio, not a production monitoring tool. One-time purchase, no subscription, no usage limits.

Private, Systematic LLM Evaluation

Why BeLLMark

Justify the decision

Find the best fit

Find the sweet spot

How It Works

Configure Models

Define Questions

Run Benchmark

Analyze Results

Everything you need for systematic LLM evaluation

Blind A/B/C Testing

LLM-as-Judge

AI Criteria Generation

9 LLM Providers

Statistical Analysis

Rich Exports

Evaluation methodology you can defend

Blind Evaluation

Transparent Rubrics

Auditable Reasoning

See BeLLMark in action

Watch a full benchmark run

Or explore a real result yourself

Detailed results & per-criterion scoring

Ships with 5 research-backed benchmark suites

Built for teams who need to make informed AI decisions

Compliance & Legal

AI Consultants

Engineering Teams

Three ways to use BeLLMark

Frequently Asked Questions

Ready to evaluate LLMs the right way?