EvalAssist is an application that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience.
Key Features
Flexible evaluation methods for direct assessment and pairwise comparison
AI-assisted criteria refinement with synthetic data generation
Built-in trustworthiness metrics including positional bias analysis
Scalable toolkit built on Unitxt evaluation library
Integration with diverse LLM judges including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4
Test case catalog with community contribution support