Self-hosted benchmarking studio for comparing language models with configurable judges, blind evaluation, and real-time results.
Manual testing is slow, evaluation is biased, and your data shouldn't leave your infrastructure.
Copy-pasting prompts between different LLM platforms, manually comparing responses, and trying to remember which model performed better wastes hours of valuable development time.
When you know which model generated which response, unconscious bias affects your judgment. Your team's preferences and expectations skew results, leading to poor model selection decisions.
Sending proprietary prompts and evaluation data to third-party benchmarking services creates compliance risks. Legal teams need assurance that sensitive data stays on your infrastructure.
Run BeLLMark on your own infrastructure. Your prompts, API keys, and evaluation results stay on your infrastructure. Supports compliance workflows for regulated industries including healthcare and finance.
Responses are shuffled as A/B/C before judging. Neither you nor your judges know which model generated which response until after scoring, eliminating unconscious bias.
Use any LLM as a judge with custom evaluation criteria. Compare GPT-4, Claude, Gemini, and local models side-by-side with AI-generated or custom scoring rubrics.
BeLLMark doesn't just compare models — it gives you a defensible evaluation process.
Responses are shuffled and assigned blind labels (A, B, C). Judges evaluate without knowing which model produced which response. Mapping is revealed only after scoring is complete.
AI generates evaluation criteria from your use case description. You review, edit, and approve the rubric before any benchmark runs. Your criteria, your standards — not a black box.
Every judge score includes written reasoning. Expand any result to read exactly why a judge scored a response the way it did. Use multi-judge mode for confidence through agreement.
Each response is scored on a 1–10 scale per criterion, with defined anchors: 1–3 (poor), 4–6 (acceptable), 7–10 (good to excellent). You define the criteria that matter for your use case — accuracy, completeness, tone, or anything else.
Scores are aggregated using arithmetic mean across judges and criteria, with Wilson Score confidence intervals on win rates and bootstrap CI on overall scores. Inter-rater reliability is measured with Cohen’s/Fleiss’ Kappa. Pairwise comparisons use Wilcoxon signed-rank tests with Holm–Bonferroni correction to prevent false positives. Length bias and position bias are detected automatically.
Add API keys for OpenAI, Anthropic, Google, or local LM Studio endpoints.
Write custom prompts or use AI to generate domain-specific test cases.
Models generate responses in parallel, then judges evaluate them blindly.
View charts, ELO rankings, statistical significance tests, and bias analysis. Export to HTML, JSON, CSV, PPTX, or PDF.
Responses shuffled before judging, with revealed mappings only after scoring. Eliminates model bias and ensures objective evaluation.
Use one or more language models as judges with customizable criteria. Choose separate scoring or direct comparison modes.
Let an LLM design evaluation rubrics for your specific use case, or write custom scoring criteria from scratch.
OpenAI, Anthropic, Google, Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. Add your own providers easily.
Bootstrap CI, Wilcoxon significance tests, Cohen’s d effect sizes, ELO ratings, bias detection, and judge calibration — all built into the dashboard.
Export to HTML reports, JSON, CSV, consulting-grade PPTX, or PDF — all formats include statistical summaries and confidence intervals.
Export your benchmarks as professional HTML reports, consulting-grade PPTX presentations, JSON, CSV, or PDF — with full judge reasoning, blind mapping reveal, and data-driven recommendations.
Evaluate LLM accuracy on legal reasoning, contract analysis, and regulatory interpretation without sending client data to external benchmarking services.
Provide clients with objective, data-driven model recommendations backed by systematic benchmarking on their specific use cases.
Make informed decisions about which LLM to use in production by testing on real prompts before committing to API contracts.
Run on your infrastructure. Your data stays on your infrastructure. Supports compliance workflows through self-hosting and zero telemetry.
One-time purchase, lifetime license. No recurring fees, no per-seat charges, no usage limits.
Test local LM Studio models alongside cloud providers. Compare cost vs. quality tradeoffs systematically.
Clean web interface, no technical configuration needed. Non-technical stakeholders can run benchmarks independently.
Run your first blind evaluation in 30 minutes — no license required. BeLLMark is free for non-commercial use. When you're ready for production, the commercial license is a one-time $799 purchase ($499 introductory during the first 60 days).
Join the Waitlist →One license covers unlimited users within a single legal entity (corporation, LLC, nonprofit, etc.). If you have multiple subsidiaries or separate legal entities, each needs its own license. Freelancers and sole proprietors need one license for their business use.
BeLLMark is source-available under the PolyForm Noncommercial 1.0.0 license. You can view, modify, and use the code for free for personal, educational, and non-commercial purposes. Commercial use requires a paid license ($799 one-time per legal entity, $499 introductory during the first 60 days). See our licensing terms for what the commercial license covers.
No programming knowledge required! BeLLMark has a clean web interface. If you can use a web browser and have API keys for LLM providers (like OpenAI or Anthropic), you can run benchmarks. Installation requires basic command-line familiarity (Docker or Python/Node.js).
Three installation options: (1) Docker Compose (recommended, one command), (2) Manual setup with Python backend + Node.js frontend, or (3) Production build served from a single backend process. Full instructions in the GitHub repository. Typical setup time: 5-10 minutes.
BeLLMark supports OpenAI (GPT-4, GPT-5, o1), Anthropic (Claude Opus/Sonnet/Haiku), Google (Gemini 2.5/3), Grok, DeepSeek, GLM, Kimi, Mistral, and local LM Studio models. The architecture is modular—adding new providers is straightforward by implementing the OpenAI-compatible endpoint pattern.
All updates are free for life — including future major versions. Pull the latest code from GitHub whenever a new version is released. No subscription fees, no forced upgrade cycles, no license keys to manage. Your commercial license covers all future features and improvements.
Everything runs on your infrastructure. API keys are encrypted at rest in your local SQLite database. Prompts, responses, and evaluation results stay on your server. BeLLMark sends zero telemetry and makes no outbound calls except to the LLM providers you configure. This architecture supports your compliance goals for frameworks like GDPR and HIPAA — actual certifications depend on your infrastructure setup and LLM provider agreements. Contact us for framework-specific guidance.
BeLLMark sends each model's response to a judge LLM along with your evaluation criteria. The judge scores each response on a 1-10 scale per criterion, providing written reasoning for each score. Responses are presented with blind labels (A, B, C) so the judge doesn't know which model produced which response. For validation, use multi-judge mode (multiple LLMs evaluate independently) and check agreement. Multi-judge mode and statistical calibration ensure reliable, reproducible results.
Not yet as a built-in feature, but BeLLMark's results are fully exportable (HTML, JSON, CSV, PPTX, PDF) for human review. The recommended workflow: run LLM-as-judge for initial screening, then export the top candidates' responses for human evaluation. Native human evaluation workflows with integrated scoring are on our roadmap.
BeLLMark automatically retries failed API calls up to 3 times with progressive backoff (2s, 5s, 10s delays). It checkpoints before phase transitions (generation → judging) so partial progress is preserved. If an API call fails after all retries, the specific failure is logged and a manual retry button appears in the progress view. Other models and questions continue processing normally.
BeLLMark currently runs as a single-user application. For team use, we recommend deploying behind your existing authentication (VPN, reverse proxy with SSO, or network-level access control). Multi-user support with role-based access and team workspaces is on our roadmap. All benchmark data is stored in a single SQLite database that can be shared across the team.