Flexible evaluation methods for direct assessment and pairwise comparison
AI-assisted criteria refinement with synthetic data generation
Built-in trustworthiness metrics including positional bias analysis
Scalable toolkit built on Unitxt evaluation library
Integration with diverse LLM judges including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4
Test case catalog with community contribution support