LLM Test Containers¶
ASQI Engineer provides several pre-built test containers specifically designed for comprehensive LLM system evaluation. Each container implements industry-standard testing frameworks and provides structured evaluation metrics.
Mock Tester¶
Purpose: Development and validation testing with configurable simulation.
Framework: Custom lightweight testing framework
Location: test_containers/mock_tester/
System Requirements¶
System Under Test:
llm_api
(required) - The LLM system being tested
Input Parameters¶
delay_seconds
(integer, optional): Seconds to sleep simulating processing work
Output Metrics¶
success
(boolean): Whether test execution completed successfullyscore
(float): Mock test score (0.0 to 1.0)delay_used
(integer): Actual delay in seconds usedbase_url
(string): API endpoint that was accessedmodel
(string): Model name that was tested
Example Configuration¶
test_suite:
- name: "basic_compatibility_check"
image: "my-registry/mock_tester:latest"
systems_under_test: ["my_llm_service"]
params:
delay_seconds: 2
Build Instructions¶
cd test_containers/mock_tester
docker build -t my-registry/mock_tester:latest .
Garak Security Tester¶
Purpose: Comprehensive LLM security vulnerability assessment.
Framework: Garak - Industry-standard LLM security testing
Location: test_containers/garak/
System Requirements¶
System Under Test:
llm_api
(required) - The LLM system being tested for security vulnerabilities
Input Parameters¶
probes
(list, optional): List of specific garak probes to executeAvailable probe categories:
encoding
,promptinject
,dan
,malwaregen
,misleading
,xss
,continuation
,av_spam_scanning
,donotanswer
, and 20+ moreExample probes:
["promptinject", "encoding.InjectHex", "dan.DAN_Jailbreak"]
generations
(integer, optional): Number of generations per probe (default: 1)parallel_attempts
(integer, optional): Parallel probe execution (default: 8)
Output Metrics¶
success
(boolean): Whether garak execution completed successfullyscore
(float): Overall pass rate across all probes (0.0 to 1.0)vulnerabilities_found
(integer): Total number of failed attempts (potential vulnerabilities)total_attempts
(integer): Total number of probe attempts madeprobes_used
(list): List of probes that were executedgenerations
(integer): Number of generations per probe usedsut_type
(string): Type of system under testprobe_results
(object): Detailed results by probe and detector with individual scores
Key Security Probes¶
Prompt Injection:
promptinject.HijackHateHumans
: Tests for harmful instruction hijackingpromptinject.HijackKillHumans
: Tests for violent instruction injectionpromptinject.HijackLongPrompt
: Long-form prompt injection attacks
Encoding Attacks:
encoding.InjectBase64
: Base64-encoded malicious instructionsencoding.InjectHex
: Hexadecimal-encoded attacksencoding.InjectROT13
: ROT13-encoded instruction injectionencoding.InjectMorse
: Morse code-based encoding attacks
Jailbreak Attempts:
dan.DAN_Jailbreak
: Standard DAN (Do Anything Now) jailbreakdan.AutoDAN
: Automated jailbreak generationdan.ChatGPT_Developer_Mode_v2
: Developer mode exploitation
Content Generation:
malwaregen.Payload
: Malware code generation attemptsmalwaregen.Evasion
: Evasion technique generationmisleading.FalseAssertion
: False information generation tests
Example Configuration¶
test_suite:
- name: "comprehensive_security_scan"
image: "my-registry/garak:latest"
systems_under_test: ["production_model"]
params:
probes: [
"promptinject",
"encoding.InjectBase64",
"encoding.InjectHex",
"dan.DAN_Jailbreak",
"dan.AutoDAN",
"malwaregen.Payload",
"misleading.FalseAssertion"
]
generations: 20
parallel_attempts: 6
Build Instructions¶
cd test_containers/garak
docker build -t my-registry/garak:latest .
Environment Requirements¶
export OPENAI_API_KEY="your-api-key"
# Or other provider-specific keys based on your system configuration
DeepTeam Red Team Tester¶
Purpose: Advanced adversarial robustness testing with multi-system orchestration.
Framework: DeepEval DeepTeam - Advanced red teaming library
Location: test_containers/deepteam/
System Requirements¶
System Under Test:
llm_api
(required) - Primary system being testedSimulator System:
llm_api
(optional) - System for generating adversarial attacks and scenariosEvaluator System:
llm_api
(optional) - System for evaluating target responses
Input Parameters¶
vulnerabilities
(list, optional): Vulnerability configurations with typesEach item has
name
and optionaltypes
Available vulnerabilities:
bias
,pii_leakage
,prompt_leakage
,toxicity
,misinformation
,excessive_agency
,robustness
,competition
,intellectual_property
,illegal_activity
,graphic_content
,personal_safety
,unauthorized_access
,custom
attacks
(list, optional): Attack method namesAvailable attacks:
base64
,graybox
,leetspeak
,math_problem
,multilingual
,prompt_injection
,prompt_probing
,roleplay
,rot13
,crescendo_jailbreaking
,linear_jailbreaking
,tree_jailbreaking
,sequential_jailbreak
,bad_likert_judge
max_concurrent
(integer, optional): Maximum concurrent operations (default: 10)attacks_per_vulnerability_type
(integer, optional): Attacks per vulnerability type (default: 3)target_purpose
(string, optional): Description of target system’s purpose for context
Output Metrics¶
success
(boolean): Whether system passed security threshold (80% pass rate)pass_rate
(float): Proportion of tests that passed (secure responses)failure_rate
(float): Proportion of tests that failed (vulnerable responses)total_tests
(integer): Total number of red team tests performedtotal_passing
(integer): Number of secure responsestotal_failing
(integer): Number of vulnerable responsestotal_errored
(integer): Number of tests with errorsvulnerability_stats
(object): Detailed per-vulnerability statistics including pass ratesattack_stats
(object): Detailed per-attack method statisticsvulnerabilities_tested
(list): Vulnerability types that were testedattacks_used
(list): Attack methods that were usedmodel_tested
(string): Model identifier that was tested
Example Configuration¶
test_suite:
- name: "advanced_red_team_assessment"
image: "my-registry/deepteam:latest"
systems_under_test: ["target_chatbot"]
systems:
simulator_system: "gpt4o_attacker"
evaluator_system: "claude_judge"
params:
vulnerabilities:
- name: "bias"
types: ["gender", "racial", "political"]
- name: "toxicity"
- name: "pii_leakage"
- name: "prompt_leakage"
attacks: [
"prompt_injection",
"roleplay",
"crescendo_jailbreaking",
"linear_jailbreaking",
"leetspeak"
]
attacks_per_vulnerability_type: 8
max_concurrent: 6
target_purpose: "customer service chatbot for financial services"
Build Instructions¶
cd test_containers/deepteam
docker build -t my-registry/deepteam:latest .
Chatbot Simulator¶
Purpose: Multi-turn conversational testing with persona-based simulation and LLM-as-judge evaluation.
Framework: Custom conversation simulation with LLM evaluation
Location: test_containers/chatbot_simulator/
System Requirements¶
System Under Test:
llm_api
(required) - The chatbot system being testedSimulator System:
llm_api
(optional) - LLM for generating personas and conversation scenariosEvaluator System:
llm_api
(optional) - LLM for evaluating conversation quality
Input Parameters¶
chatbot_purpose
(string, required): Description of the chatbot’s purpose and domaincustom_scenarios
(list, optional): List of scenario objects withinput
andexpected_output
keyscustom_personas
(list, optional): Custom persona names (e.g.,["busy executive", "enthusiastic buyer"]
)num_scenarios
(integer, optional): Number of conversation scenarios to generate if custom scenarios not providedmax_turns
(integer, optional): Maximum turns per conversation (default: 4)sycophancy_levels
(list, optional): Sycophancy levels to cycle through (default:["low", "high"]
)simulations_per_scenario
(integer, optional): Simulation runs per scenario-persona combination (default: 1)success_threshold
(float, optional): Threshold for evaluation success (default: 0.7)max_concurrent
(integer, optional): Maximum concurrent conversation simulations (default: 3)
Output Metrics¶
success
(boolean): Whether test execution completed successfullytotal_test_cases
(integer): Total number of conversation test cases generatedaverage_answer_accuracy
(float): Average accuracy score across all conversations (0.0 to 1.0)average_answer_relevance
(float): Average relevance score across all conversations (0.0 to 1.0)answer_accuracy_pass_rate
(float): Percentage of conversations passing accuracy thresholdanswer_relevance_pass_rate
(float): Percentage of conversations passing relevance thresholdby_persona
(object): Performance metrics broken down by persona typeby_scenario
(object): Performance metrics broken down by test scenarioby_sycophancy
(object): Performance metrics broken down by sycophancy level
Example Configuration¶
test_suite:
- name: "customer_service_conversation_test"
image: "my-registry/chatbot_simulator:latest"
systems_under_test: ["customer_service_bot"]
systems:
simulator_system: "gpt4o_customer_simulator"
evaluator_system: "claude_conversation_judge"
params:
chatbot_purpose: "customer service for e-commerce platform specializing in electronics"
custom_scenarios:
- input: "I want to return a laptop I bought 2 months ago because it's defective"
expected_output: "Helpful explanation of return policy and steps to process return"
- input: "My order shipped but tracking shows it's been stuck for a week"
expected_output: "Empathetic response with concrete steps to investigate and resolve"
custom_personas: [
"frustrated customer with urgent need",
"polite customer seeking information",
"tech-savvy customer with detailed questions",
"elderly customer needing extra guidance"
]
num_scenarios: 12
max_turns: 6
sycophancy_levels: ["low", "medium", "high"]
success_threshold: 0.8
max_concurrent: 4
Build Instructions¶
cd test_containers/chatbot_simulator
docker build -t my-registry/chatbot_simulator:latest .
TrustLLM Tester¶
Purpose: Comprehensive trustworthiness evaluation across 6 dimensions using academic-grade benchmarks.
Framework: TrustLLM - Academic trustworthiness evaluation framework
Location: test_containers/trustllm/
System Requirements¶
System Under Test:
llm_api
(required) - The LLM system being evaluated for trustworthiness
Input Parameters¶
test_type
(string, required): Test dimension to evaluateAvailable dimensions:
ethics
,privacy
,fairness
,truthfulness
,robustness
,safety
datasets
(list, optional): Specific datasets for the chosen test type (without .json extension)Ethics datasets:
awareness
,explicit_moralchoice
,implicit_ETHICS
,implicit_SocialChemistry101
Privacy datasets:
privacy_awareness_confAIde
,privacy_awareness_query
,privacy_leakage
Fairness datasets:
disparagement
,preference
,stereotype_agreement
,stereotype_query_test
,stereotype_recognition
Truthfulness datasets:
external
,hallucination
,golden_advfactuality
,internal
,sycophancy
Robustness datasets:
ood_detection
,ood_generalization
,AdvGLUE
,AdvInstruction
Safety datasets:
jailbreak
,exaggerated_safety
,misuse
max_new_tokens
(integer, optional): Maximum tokens in LLM responses (default: 1024)max_rows
(integer, optional): Maximum rows per dataset for faster testing (default: 20)
Output Metrics¶
success
(boolean): Whether TrustLLM evaluation completed successfullytest_type
(string): The test dimension that was evaluateddatasets_tested
(list): List of dataset names that were actually testeddataset_results
(object): Individual results for each dataset with generation and evaluation results
Example Configuration¶
test_suite:
- name: "ethics_evaluation"
image: "my-registry/trustllm:latest"
systems_under_test: ["target_model"]
params:
test_type: "ethics"
datasets: ["awareness", "explicit_moralchoice"]
max_new_tokens: 512
max_rows: 50
- name: "safety_assessment"
image: "my-registry/trustllm:latest"
systems_under_test: ["target_model"]
params:
test_type: "safety"
datasets: ["jailbreak", "misuse"]
max_rows: 30
- name: "fairness_evaluation"
image: "my-registry/trustllm:latest"
systems_under_test: ["target_model"]
params:
test_type: "fairness"
# Uses all fairness datasets by default
max_rows: 25
Build Instructions¶
cd test_containers/trustllm
docker build -t my-registry/trustllm:latest .
Computer Vision Test Containers¶
While ASQI’s primary focus is LLM testing, it also includes specialized containers for computer vision evaluation:
Computer Vision Tester¶
Purpose: General computer vision model testing and evaluation.
Location: test_containers/computer_vision/
CV Tester¶
Purpose: Specialized computer vision testing framework with advanced detection capabilities.
Location: test_containers/cv_tester/
Multi-Container Testing Strategies¶
Security-Focused Assessment¶
Combine multiple security testing frameworks for comprehensive coverage:
suite_name: "Complete Security Assessment"
test_suite:
# Fast baseline security scan
- name: "baseline_security"
image: "my-registry/garak:latest"
systems_under_test: ["target_model"]
params:
probes: ["promptinject", "encoding.InjectBase64", "dan.DAN_Jailbreak"]
generations: 10
parallel_attempts: 8
# Comprehensive adversarial testing
- name: "advanced_red_team"
image: "my-registry/deepteam:latest"
systems_under_test: ["target_model"]
systems:
simulator_system: "gpt4o_attacker"
evaluator_system: "claude_security_judge"
params:
vulnerabilities:
- name: "bias"
types: ["gender", "racial"]
- name: "toxicity"
- name: "pii_leakage"
attacks: ["prompt_injection", "jailbreaking", "roleplay"]
attacks_per_vulnerability_type: 5
# Trustworthiness evaluation
- name: "trustworthiness_assessment"
image: "my-registry/trustllm:latest"
systems_under_test: ["target_model"]
params:
evaluation_dimensions: ["truthfulness", "safety", "fairness"]
Quality and Performance Testing¶
Evaluate conversational quality and system performance:
suite_name: "Chatbot Quality and Performance"
test_suite:
# Conversation quality assessment
- name: "conversation_quality"
image: "my-registry/chatbot_simulator:latest"
systems_under_test: ["customer_bot"]
systems:
simulator_system: "gpt4o_customer"
evaluator_system: "claude_judge"
params:
chatbot_purpose: "customer support for financial services"
num_scenarios: 20
max_turns: 8
sycophancy_levels: ["low", "high"]
success_threshold: 0.8
# Performance and reliability
- name: "performance_baseline"
image: "my-registry/mock_tester:latest"
systems_under_test: ["customer_bot"]
params:
delay_seconds: 0 # Test response time
Container Selection Guide¶
Choose the Right Container for Your Use Case¶
For Security Assessment:
Garak: Comprehensive vulnerability scanning with 40+ probes
DeepTeam: Advanced red teaming with multi-system orchestration
Combined: Use both for complete security coverage
For Conversational Quality:
Chatbot Simulator: Multi-turn dialogue testing with persona-based evaluation
TrustLLM: Academic-grade trustworthiness assessment
For Development and Validation:
Mock Tester: Quick compatibility and configuration validation
For Research and Benchmarking:
TrustLLM: Standardized academic benchmarks
DeepTeam: Research-grade adversarial evaluation
Performance Considerations¶
Container Resource Requirements:
Mock Tester: Minimal resources, fast execution
Garak: Medium resources, depends on probe selection and generations
Chatbot Simulator: Medium-high resources, depends on conversation complexity
DeepTeam: High resources, requires multiple LLM API calls
TrustLLM: High resources, comprehensive benchmark evaluation
Optimization Tips:
Start with smaller
generations
andnum_scenarios
for developmentUse
parallel_attempts
andmax_concurrent
to balance speed vs. resource usageTest with Mock Tester first to validate configuration before expensive tests
Use
--concurrent-tests
CLI option to run multiple containers in parallel
Environment and API Key Management¶
Required Environment Variables by Container¶
Garak:
# Requires API key for target system
export OPENAI_API_KEY="sk-your-key" # For OpenAI systems
export ANTHROPIC_API_KEY="sk-ant-your-key" # For Anthropic systems
DeepTeam:
# Requires API keys for all three systems (target, simulator, evaluator)
export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"
Chatbot Simulator:
# Requires API keys for target, simulator, and evaluator systems
export OPENAI_API_KEY="sk-your-openai-key" # For GPT-based simulation
export ANTHROPIC_API_KEY="sk-ant-your-key" # For Claude-based evaluation
LiteLLM Proxy Integration¶
All containers work seamlessly with LiteLLM proxy for unified provider access:
LiteLLM Configuration (config.yaml):
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: env/OPENAI_API_KEY
- model_name: claude-3-5-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: env/ANTHROPIC_API_KEY
System Configuration:
systems:
proxy_target:
type: "llm_api"
params:
base_url: "http://localhost:4000/v1"
model: "gpt-4o"
api_key: "sk-1234" # LiteLLM proxy key
proxy_evaluator:
type: "llm_api"
params:
base_url: "http://localhost:4000/v1"
model: "claude-3-5-sonnet"
api_key: "sk-1234"
This approach centralizes API key management and provides unified access to 100+ LLM providers through a single proxy endpoint.