The Gold Standard for Self-Hosted LLM Evaluation

Self-hosted benchmarking studio for comparing language models with configurable judges, blind evaluation, and real-time results.

BeLLMark results dashboard with charts and statistical analysis

Choosing the wrong LLM costs time and money

Manual testing is slow, evaluation is biased, and your data shouldn't leave your infrastructure.

⏱️

Manual Testing is Slow

Copy-pasting prompts between different LLM platforms, manually comparing responses, and trying to remember which model performed better wastes hours of valuable development time.

🎭

Evaluation Bias

When you know which model generated which response, unconscious bias affects your judgment. Your team's preferences and expectations skew results, leading to poor model selection decisions.

🔒

Data Privacy Concerns

Sending proprietary prompts and evaluation data to third-party benchmarking services creates compliance risks. Legal teams need assurance that sensitive data stays on your infrastructure.

BeLLMark solves this in three ways

🏠

Self-Hosted Privacy

Run BeLLMark on your own infrastructure. Your prompts, API keys, and evaluation results stay on your infrastructure. Supports compliance workflows for regulated industries including healthcare and finance.

👁️

Blind Evaluation

Responses are shuffled as A/B/C before judging. Neither you nor your judges know which model generated which response until after scoring, eliminating unconscious bias.

⚙️

Configurable Judges

Use any LLM as a judge with custom evaluation criteria. Compare GPT-4, Claude, Gemini, and local models side-by-side with AI-generated or custom scoring rubrics.

Evaluation you can defend

BeLLMark doesn't just compare models — it gives you a defensible evaluation process.

🎲

Blind by Default

Responses are shuffled and assigned blind labels (A, B, C). Judges evaluate without knowing which model produced which response. Mapping is revealed only after scoring is complete.

📋

Transparent Rubrics

AI generates evaluation criteria from your use case description. You review, edit, and approve the rubric before any benchmark runs. Your criteria, your standards — not a black box.

🔍

Auditable Reasoning

Every judge score includes written reasoning. Expand any result to read exactly why a judge scored a response the way it did. Use multi-judge mode for confidence through agreement.

How Scoring Works

Each response is scored on a 1–10 scale per criterion, with defined anchors: 1–3 (poor), 4–6 (acceptable), 7–10 (good to excellent). You define the criteria that matter for your use case — accuracy, completeness, tone, or anything else.

Scores are aggregated using arithmetic mean across judges and criteria, with Wilson Score confidence intervals on win rates and bootstrap CI on overall scores. Inter-rater reliability is measured with Cohen’s/Fleiss’ Kappa. Pairwise comparisons use Wilcoxon signed-rank tests with Holm–Bonferroni correction to prevent false positives. Length bias and position bias are detected automatically.

Learn more in our FAQ →

How It Works

1

Configure Models

Add API keys for OpenAI, Anthropic, Google, or local LM Studio endpoints.

2

Define Questions

Write custom prompts or use AI to generate domain-specific test cases.

3

Run Benchmark

Models generate responses in parallel, then judges evaluate them blindly.

4

Analyze Results

View charts, ELO rankings, statistical significance tests, and bias analysis. Export to HTML, JSON, CSV, PPTX, or PDF.

Everything you need for systematic LLM evaluation

🎲

Blind A/B/C Testing

Responses shuffled before judging, with revealed mappings only after scoring. Eliminates model bias and ensures objective evaluation.

⚖️

LLM-as-Judge

Use one or more language models as judges with customizable criteria. Choose separate scoring or direct comparison modes.

🤖

AI Criteria Generation

Let an LLM design evaluation rubrics for your specific use case, or write custom scoring criteria from scratch.

🌐

9 LLM Providers

OpenAI, Anthropic, Google, Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. Add your own providers easily.

📊

Statistical Analysis

Bootstrap CI, Wilcoxon significance tests, Cohen’s d effect sizes, ELO ratings, bias detection, and judge calibration — all built into the dashboard.

📑

Rich Exports

Export to HTML reports, JSON, CSV, consulting-grade PPTX, or PDF — all formats include statistical summaries and confidence intervals.

See BeLLMark in action

BeLLMark model configuration page
Configure model presets with encrypted API keys
BeLLMark ELO leaderboard ranking 45 models
ELO leaderboard tracking 45 models across all benchmark runs
BeLLMark results with charts and analysis
Interactive results with charts and full response details

Share results with stakeholders

Export your benchmarks as professional HTML reports, consulting-grade PPTX presentations, JSON, CSV, or PDF — with full judge reasoning, blind mapping reveal, and data-driven recommendations.

Built for teams who need to make informed AI decisions

⚖️

Compliance & Legal

Evaluate LLM accuracy on legal reasoning, contract analysis, and regulatory interpretation without sending client data to external benchmarking services.

  • Test contract summarization accuracy
  • Compare legal reasoning capabilities
  • Validate compliance advisory quality
  • Keep sensitive data on-premises
💼

AI Consultants

Provide clients with objective, data-driven model recommendations backed by systematic benchmarking on their specific use cases.

  • Generate client-specific test cases
  • Deliver professional HTML reports
  • Compare cost vs. performance tradeoffs
  • Justify model selection decisions
🛠️

Engineering Teams

Make informed decisions about which LLM to use in production by testing on real prompts before committing to API contracts.

  • Test local vs. cloud model quality
  • Validate prompt engineering changes
  • Compare reasoning model performance
  • A/B test prompt templates

Why teams choose BeLLMark

🏠

Self-Hosted

Run on your infrastructure. Your data stays on your infrastructure. Supports compliance workflows through self-hosting and zero telemetry.

💰

No Subscription

One-time purchase, lifetime license. No recurring fees, no per-seat charges, no usage limits.

🖥️

Local Models

Test local LM Studio models alongside cloud providers. Compare cost vs. quality tradeoffs systematically.

Accessible

Clean web interface, no technical configuration needed. Non-technical stakeholders can run benchmarks independently.

Simple, transparent pricing

Commercial License
BeLLMark Business
$799
one-time, per legal entity ($499 introductory — first 60 days)
  • Unlimited users within your organization
  • Self-hosted on your infrastructure
  • All current and future features
  • 9 LLM provider integrations
  • Blind A/B/C evaluation
  • AI criteria generation
  • Real-time progress tracking
  • HTML/JSON/CSV exports
  • Email support
  • Free updates for life

Be the first to know when BeLLMark launches commercially.

Try Before You Buy

Run your first blind evaluation in 30 minutes — no license required. BeLLMark is free for non-commercial use. When you're ready for production, the commercial license is a one-time $799 purchase ($499 introductory during the first 60 days).

Join the Waitlist →
Free for personal, educational, and non-commercial use under PolyForm Noncommercial 1.0.0 license. Commercial license is for production and revenue-generating use.

Frequently Asked Questions

What does "per legal entity" mean?

One license covers unlimited users within a single legal entity (corporation, LLC, nonprofit, etc.). If you have multiple subsidiaries or separate legal entities, each needs its own license. Freelancers and sole proprietors need one license for their business use.

Is BeLLMark open source?

BeLLMark is source-available under the PolyForm Noncommercial 1.0.0 license. You can view, modify, and use the code for free for personal, educational, and non-commercial purposes. Commercial use requires a paid license ($799 one-time per legal entity, $499 introductory during the first 60 days). See our licensing terms for what the commercial license covers.

Do I need technical skills to use BeLLMark?

No programming knowledge required! BeLLMark has a clean web interface. If you can use a web browser and have API keys for LLM providers (like OpenAI or Anthropic), you can run benchmarks. Installation requires basic command-line familiarity (Docker or Python/Node.js).

How do I install BeLLMark?

Three installation options: (1) Docker Compose (recommended, one command), (2) Manual setup with Python backend + Node.js frontend, or (3) Production build served from a single backend process. Full instructions in the GitHub repository. Typical setup time: 5-10 minutes.

What LLM providers are supported?

BeLLMark supports OpenAI (GPT-4, GPT-5, o1), Anthropic (Claude Opus/Sonnet/Haiku), Google (Gemini 2.5/3), Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. The architecture is modular—adding new providers is straightforward by implementing the OpenAI-compatible endpoint pattern.

How do updates work?

All updates are free for life — including future major versions. Pull the latest code from GitHub whenever a new version is released. No subscription fees, no forced upgrade cycles, no license keys to manage. Your commercial license covers all future features and improvements.

What about my API keys and data privacy?

Everything runs on your infrastructure. API keys are encrypted at rest in your local SQLite database. Prompts, responses, and evaluation results stay on your server. BeLLMark sends zero telemetry and makes no outbound calls except to the LLM providers you configure. This architecture supports your compliance goals for frameworks like GDPR and HIPAA — actual certifications depend on your infrastructure setup and LLM provider agreements. Contact us for framework-specific guidance.

How does LLM-as-judge work and how do you validate it?

BeLLMark sends each model's response to a judge LLM along with your evaluation criteria. The judge scores each response on a 1-10 scale per criterion, providing written reasoning for each score. Responses are presented with blind labels (A, B, C) so the judge doesn't know which model produced which response. For validation, use multi-judge mode (multiple LLMs evaluate independently) and check agreement. Multi-judge mode and statistical calibration ensure reliable, reproducible results.

Can we use human raters alongside LLM judges?

Not yet as a built-in feature, but BeLLMark's results are fully exportable (HTML, JSON, CSV, PPTX, PDF) for human review. The recommended workflow: run LLM-as-judge for initial screening, then export the top candidates' responses for human evaluation. Native human evaluation workflows with integrated scoring are on our roadmap.

How do you handle rate limits, failures, and retries?

BeLLMark automatically retries failed API calls up to 3 times with progressive backoff (2s, 5s, 10s delays). It checkpoints before phase transitions (generation → judging) so partial progress is preserved. If an API call fails after all retries, the specific failure is logged and a manual retry button appears in the progress view. Other models and questions continue processing normally.

Do you support role-based access or multiple workspaces?

BeLLMark currently runs as a single-user application. For team use, we recommend deploying behind your existing authentication (VPN, reverse proxy with SSO, or network-level access control). Multi-user support with role-based access and team workspaces is on our roadmap. All benchmark data is stored in a single SQLite database that can be shared across the team.

Ready to evaluate LLMs the right way?