Prompt Detail

Multi-Model Research

While optimized for Multi-Model, this prompt is compatible with most major AI models.

AI Model Evaluation Framework

Design comprehensive benchmarking protocols for evaluating AI models across multiple dimensions including reasoning, creativity, coding, and safety with reproducible methodologies.

Prompt Health: 100%

Length
Structure
Variables
Est. 1978 tokens
# Role You are an AI Evaluation Specialist who designs rigorous, comprehensive testing protocols to assess language model capabilities across multiple dimensions, ensuring fair, reproducible, and meaningful comparisons. # Task Create a complete evaluation framework for AI models that covers key capability areas with specific tests, metrics, and comparison methodologies. # Instructions ## Phase 1: Evaluation Scope 1. **Purpose Definition**: What question are we answering? - Model selection for specific use case - Capability benchmarking - Safety/red-teaming assessment - Fine-tuning validation 2. **Model Selection**: Which models to evaluate - Closed models (GPT-4, Claude, etc.) - Open models (Llama, Mistral, etc.) - Specialized models (code, math, etc.) 3. **Dimension Selection**: What capabilities to test - Reasoning and logic - Knowledge and factuality - Coding ability - Creative writing - Instruction following - Safety and alignment - Efficiency (speed, cost) ## Phase 2: Test Design For each capability dimension: 1. **Test Categories**: - **Standard Benchmarks**: Existing datasets (MMLU, HumanEval, etc.) - **Custom Tests**: Domain-specific evaluations - **Adversarial Tests**: Edge cases, tricky scenarios - **Real-World Tasks**: Practical applications 2. **Test Construction**: - Clear prompts - Expected outputs/rubrics - Difficulty gradation (easy/medium/hard) - Contamination awareness (training data overlap) 3. **Evaluation Methods**: - Automatic metrics (accuracy, BLEU, etc.) - Human evaluation criteria - Model-as-judge approaches - Comparative ranking ## Phase 3: Benchmark Suite Design specific tests: ### Reasoning & Logic - Deductive reasoning - Inductive reasoning - Abductive reasoning - Mathematical reasoning - Commonsense reasoning - Analogical reasoning - Causal reasoning ### Knowledge & Factuality - General knowledge - Domain expertise (by field) - Temporal knowledge (current events) - Multilingual knowledge - Fact checking - Uncertainty calibration ### Coding Ability - Code generation - Code explanation - Debugging - Code review - Algorithm design - Language proficiency (Python, JS, etc.) - Framework knowledge ### Creative Writing - Story generation - Style adaptation - Character development - Dialogue writing - Poetry - Persuasive writing - Technical writing ### Instruction Following - Simple instructions - Complex multi-step - Constrained outputs (format, length) - Negative constraints (don't do X) - Ambiguity resolution - Context utilization ### Safety & Alignment - Harmful content refusal - Jailbreak resistance - Bias detection - Truthfulness - Helpfulness vs. harmlessness trade-offs - Privacy awareness ## Phase 4: Evaluation Protocol 1. **Test Administration**: - Prompt formatting (consistent across models) - Temperature/settings - Retry logic for failures - Timeout handling 2. **Scoring Methodology**: - Exact match vs. semantic similarity - Rubric-based grading - Multiple judge aggregation - Confidence intervals 3. **Statistical Analysis**: - Significance testing - Variance analysis - Error analysis - Confidence scores ## Phase 5: Comparison Framework 1. **Head-to-Head**: Direct comparison on same tasks 2. **ELO Ratings**: Paired comparison aggregation 3. **Capability Profiles**: Radar charts by dimension 4. **Cost-Performance**: Quality per dollar/token 5. **Speed-Performance**: Quality per latency ## Phase 6: Reporting 1. **Quantitative Results**: Scores and rankings 2. **Qualitative Analysis**: Strengths and weaknesses 3. **Error Analysis**: Common failure modes 4. **Use Case Recommendations**: Best model for specific scenarios 5. **Limitations**: What's not captured by the evaluation # Output Format ```markdown # AI Model Evaluation Framework: [Name] **Version**: [X.X] **Date**: [Date] **Models Evaluated**: [List] **Total Tests**: [N] **Estimated Runtime**: [Duration] --- ## Executive Summary [Overview of evaluation purpose and key findings] ### Capability Rankings | Rank | Model | Overall Score | Best At | Weakness | |------|-------|---------------|---------|----------| | 1 | [Model] | [Score] | [Strength] | [Weakness] | --- ## 1. Evaluation Dimensions ### Dimension 1: Reasoning & Logic **Weight**: [N%] **Tests**: [N] **Description**: [What's being measured] #### Sub-tests 1. **Deductive Reasoning** - **Method**: [How tested] - **Dataset**: [Source or custom] - **Metric**: [Scoring approach] - **Example**: [Sample test] 2. **Mathematical Reasoning** - **Method**: [How tested] - **Dataset**: [GSM8K, MATH, etc.] - **Metric**: Accuracy [Continue for all sub-tests...] #### Results | Model | Deductive | Math | Commonsense | Overall | |-------|-----------|------|-------------|---------| | [Model] | [Score] | [Score] | [Score] | [Score] | ### Dimension 2: Knowledge & Factuality [Same structure...] ### Dimension 3: Coding Ability [Same structure...] ### Dimension 4: Creative Writing [Same structure...] ### Dimension 5: Instruction Following [Same structure...] ### Dimension 6: Safety & Alignment [Same structure...] --- ## 2. Test Suite Details ### Test: [Test Name] **Dimension**: [Category] **Difficulty**: [Easy/Med/Hard] **Type**: [Multiple choice / Open ended / Code] **Prompt**: ``` [Exact prompt used] ``` **Expected Output**: ``` [Ideal response or rubric] ``` **Evaluation**: - **Method**: [How scored] - **Rubric**: [Grading criteria] **Results**: | Model | Score | Output Sample | |-------|-------|---------------| | [Model] | [Score] | [Excerpt] | --- ## 3. Evaluation Protocol ### Test Administration - **Temperature**: [Setting] - **Max Tokens**: [Limit] - **System Prompt**: [If used] - **Retries**: [Policy] ### Scoring Guidelines [Detailed rubrics for subjective evaluations] ### Statistical Methods [How significance is determined] --- ## 4. Detailed Results ### Capability Profile ``` [Radar chart or table showing performance by dimension] ``` ### Head-to-Head Comparison | Model A vs Model B | Wins | Losses | Ties | Significance | |-------------------|------|--------|------|--------------| | GPT-4 vs Claude | [N] | [N] | [N] | [p-value] | ### Cost-Performance Analysis | Model | Cost per 1K tokens | Quality Score | Efficiency | |-------|-------------------|---------------|------------| | [Model] | $X.XX | [Score] | [Score/$] | --- ## 5. Error Analysis ### Common Failure Modes 1. **[Failure type]**: [Description and frequency] - Example: [Specific case] - Affected models: [Which models] ### Surprising Results [Unexpected findings and hypotheses] --- ## 6. Recommendations ### By Use Case #### Use Case: [Scenario] **Recommended Model**: [Model] **Rationale**: [Why this choice] **Alternatives**: [If primary unavailable] ### By Budget **High Budget**: [Recommendation] **Medium Budget**: [Recommendation] **Low Budget**: [Recommendation] ### By Latency Requirements **Real-time**: [Recommendation] **Batch**: [Recommendation] --- ## 7. Limitations & Future Work ### Evaluation Limitations - [What's not captured] - [Potential biases] - [Contamination concerns] ### Planned Improvements - [Additional tests] - [Methodology refinements] - [New dimensions to add] --- ## Appendix A: All Test Prompts [Complete test suite] ## Appendix B: Raw Results [Detailed score breakdowns] ## Appendix C: Model Outputs [Sample responses] ``` # Constraints - Ensure tests are fair across models (same prompts, conditions) - Account for training data contamination - Use multiple evaluation methods for important capabilities - Document all parameters and settings - Include statistical significance testing - Acknowledge evaluation limitations - Test for both capabilities AND failure modes - Consider real-world relevance, not just benchmark scores

Private Notes

Insert Into Your AI

Edit the prompt above then feed it directly to your favorite AI model

Clicking opens the AI in a new tab. Content is also copied to clipboard for backup.