Active IBM Research Python
EvalAssist is an application that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience.

Key Features

Flexible evaluation methods for direct assessment and pairwise comparison
AI-assisted criteria refinement with synthetic data generation
Built-in trustworthiness metrics including positional bias analysis
Scalable toolkit built on Unitxt evaluation library
Integration with diverse LLM judges including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4
Test case catalog with community contribution support
6+
Publications
10+
Team Members
2024-2025
Active Development

More Projects Coming Soon

I'm actively working on additional open source projects that will be shared here. Stay tuned for updates!