Consistency Check
This page covers the pipeline architecture, language-aware context filter, benchmark results, and PECV-bench evaluation workflow. For instructor-facing usage documentation see Consistency Check.
Pipeline Architecture
Three-phase pipeline in HyperionConsistencyCheckService:
Phase 1 — Language-Aware Context Filtering
A ProgrammingExercise enters the ConsistencyCheckService. The Context Renderer Service invokes HyperionProgrammingLanguageContextFilterService (Context Filter) to strip build artifacts, IDE configs, binaries, and files over 100 KB. Language-specific exclusions applied on top. Filtered, line-numbered artifact text is the output.
Phase 2 — Structural and Semantic Analysis (concurrent)
Two prompts run in parallel via Reactor with Spring AI's ChatClient (fork/join bars in the diagram):
- Structural checker (
consistency_structural.st) — return-type, parameter, constructor, attribute-type, visibility mismatches - Semantic checker (
consistency_semantic.st) — identifier naming inconsistencies
Both outputs merge into a combined Consistency Issues list.
Phase 3 — Verification Pass (serial)
consistency_verification.st receives the combined list. Removes false positives, deduplicates, corrects line ranges and descriptions, self-verifies. Returns Verified Consistency Issues. Fails gracefully — falls back to pre-verification list on error.
Language-Aware Context Filter
HyperionProgrammingLanguageContextFilterService uses a Strategy pattern — one ExclusionStrategy registered per ProgrammingLanguage.
Global Exclusions
Applied regardless of language: VCS/IDE dirs (.git/, .idea/), build outputs (target/, build/, dist/), dependency dirs (node_modules/, __pycache__/), binary extensions (.class, .jar, .o, .dll, .exe, .png, .zip, …), exercise metadata files.
Per-Language Exclusions
| Language | Additional exclusions |
|---|---|
| Java | Gradle/Maven wrappers (gradlew*, mvnw*), Eclipse files (.settings/, .classpath, .project) |
| Python | Virtual environments (venv/, .venv/, env/), bytecode (*.pyc, *.egg-info/) |
| C | CMake build dirs (cmake-build-*/), CMakeCache.txt |
| Assembler | Global only |
| Swift | .swiftpm/, Package.resolved |
| SQL | Global only |
After pattern matching, a binary safety net allows only known-safe extensions and exact filenames (Dockerfile, Makefile, Jenkinsfile, …). Files over 100 KB are dropped.
Adding a New Language
// HyperionProgrammingLanguageContextFilterService constructor
register(new ExclusionStrategy(ProgrammingLanguage.KOTLIN,
List.of("glob:**/.gradle/**", "glob:**/build/**")));
Benchmark Results
Generated by PECV-bench. All Artemis approaches: azure-openai-gpt-5-mini, reasoning_effort=medium.
V1 (91 variants, Java only, 3 runs per variant)
| Approach | N runs | TP | FP | FN | Precision | Recall | F1 | Span F1 | Avg Time (s) | Avg Cost (€) |
|---|---|---|---|---|---|---|---|---|---|---|
artemis-HEAD-66c9e6b98d — Baseline (no filter) | 3 | 267 | 214 | 12 | 0.555 | 0.957 | 0.703 | 0.431 | 19.4 | 0.009 |
artemis-develop-e2ee1d1f1c — Language Filter | 3 | 260 | 169 | 19 | 0.606 | 0.932 | 0.734 | 0.427 | 19.4 | 0.009 |
artemis-feature-hyperion-unified-consistency-check-32471b9a64 — Unified Checker | 3 | 265 | 34 | 14 | 0.886 | 0.950 | 0.917 | 0.445 | 65.1 | 0.006 |
artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd — Verification Checker (current) | 3 | 259 | 31 | 20 | 0.893 | 0.928 | 0.910 | 0.564 | 35.9 | 0.014 |
pecv-reference (openai:gpt-5-mini) | 3 | 262 | 197 | 17 | 0.571 | 0.939 | 0.710 | 0.433 | 31.6 | — |
V2 (325 variants, 6 languages, 1 run per variant)
| Approach | N runs | TP | FP | FN | Precision | Recall | F1 | Span F1 | Avg Time (s) | Avg Cost (€) |
|---|---|---|---|---|---|---|---|---|---|---|
artemis-HEAD-66c9e6b98d — Baseline (no filter) | 1 | 260 | 230 | 71 | 0.531 | 0.785 | 0.633 | 0.625 | 27.3 | 0.012 |
artemis-develop-e2ee1d1f1c — Language Filter | 1 | 256 | 204 | 75 | 0.557 | 0.773 | 0.647 | 0.616 | 21.4 | 0.012 |
artemis-feature-hyperion-unified-consistency-check-32471b9a64 — Unified Checker | 1 | 250 | 76 | 81 | 0.767 | 0.755 | 0.761 | 0.604 | 67.4 | 0.007 |
artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd — Verification Checker (current) | 1 | 254 | 56 | 77 | 0.819 | 0.767 | 0.793 | 0.501 | 44.3 | 0.018 |
pecv-reference (openai:gpt-5-mini) | 1 | 267 | 207 | 64 | 0.563 | 0.807 | 0.663 | 0.630 | 36.7 | — |
The verification checker is the current production approach: 76% fewer false positives than the baseline on V2, recall above the 0.75 threshold.
Benchmarking Workflow
Overview
The workflow runs end-to-end via run_pecv_bench.py and connects Artemis to two external repositories:
- PECV-bench — public: evaluation framework, public exercise dataset, ground-truth annotations, accumulated approach results.
- PECV-bench-dataset — private: confidential solution repos and test suites. Directory structure mirrors
data/in PECV-bench; both are merged before variant materialization.
The diagram shows the three swim lanes: Developer triggers the run; Artemis handles course management, exercise import, and consistency check execution; PECV-bench materializes variants, prepares exercise ZIPs, saves results, and generates the report. The create_course flag in config.ini controls whether the workflow creates a new benchmark course or reuses an existing one (diamond in the diagram).
Confidential Exercise Storage
Store confidential artifacts in PECV-bench-dataset at the same relative path they would occupy under data/ in PECV-bench. Paths must match exactly.
Dataset Versions
| Version | Exercises | Variants | Languages |
|---|---|---|---|
| V1 | 3 (ITP2425) | 91 | Java |
| V2 | 16 (ERA2021, ISE22, ITP2425, MTG26, IOS26, QCSL25) | 325 | Assembler, C, Java, Python, SQL, Swift |
Prerequisites
- Running Artemis instance with Hyperion and LLM configured.
- Git access to PECV-bench and PECV-bench-dataset.
- Python 3.13 with pip.
- Configured
config.ini(see below).
Configuration
supporting_scripts/hyperion/consistency-check-benchmark/config.ini:
[Settings]
admin_user = artemis_admin
admin_password = artemis_admin
max_threads = 5
server_url = http://localhost:8080/api
[PECVCourseSettings]
pecv_bench_dir = pecv-bench
pecv_bench_dataset_dir = pecv-bench-dataset
[PECVExerciseSettings]
dataset_version = V2
[PECVConsistencyCheckSettings]
model_name = azure-openai-gpt-5-mini
model_effort = medium
consistency_check_exercises = {"V2": {"ITP2425": ["H01E01-Lectures", ...]}}
model_name and model_effort must match spring.ai.azure.openai.chat.options.deployment-name in Artemis.
Running the Workflow
- Navigate to the benchmark folder:
cd supporting_scripts/hyperion/consistency-check-benchmark
- Create a virtual environment (Python 3.13):
python3.13 -m venv venv
-
Activate:
- macOS/Linux:
source venv/bin/activate - Windows:
venv\Scripts\activate
- macOS/Linux:
-
Install dependencies:
- macOS/Linux:
pip install -r requirements.txt - Windows:
Makepython -m pip install --upgrade pippython -m pip install -r requirements.txt
patch.exeavailable (Git for Windows):Restart PowerShell, then continue.setx PATH "$env:PATH;C:\Program Files\Git\usr\bin"
- macOS/Linux:
-
Authenticate with GitHub CLI:
gh auth login
- Run:
- macOS/Linux:
python3 run_pecv_bench.py - Windows:
python run_pecv_bench.py
- macOS/Linux:
The script executes the following steps in sequence:
| Steps | What happens |
|---|---|
| 1–3 | Clone/pull PECV-bench and PECV-bench-dataset; install PECV-bench Python package |
| 4 | Materialize exercise variants (copy base + apply patch) |
| 5–8 | Create session; create or reuse benchmark course on Artemis |
| 9–10 | Build variant ZIPs; import into Artemis via programming-exercise import API |
| 11–12 | Fetch exercise IDs; trigger consistency checks concurrently |
| 13 | Generate aggregated report; create Hyperion source snapshot ZIP |
| 14 | Open PR against artemis-approaches in PECV-bench with results |
Approach Identity
artemis-<branch-slashes-to-dashes-hyphens-to-underscores>-<short-commit>
# e.g. branch feature/hyperion/verify-checker → artemis-feature-hyperion-verify_checker-a1b2c3d
Script replaces / → - and - → _ in the branch name. Use the printed approach_id from logs — do not reconstruct manually.
Running Steps Independently
Logs indicate exactly which step failed and which script to rerun.
| Module | Entry point | When to rerun |
|---|---|---|
exercises.py | python exercises.py | Variant materialization or ZIP creation failed |
course.py | python course.py | Course creation or exercise ID retrieval failed |
consistency_check.py | python consistency_check.py | One or more checks failed; most common rerun |
report.py | python report.py | Report generation failed after checks completed |
code_snapshot.py | python code_snapshot.py | Snapshot ZIP not created |
merge_request.py | python merge_request.py | PR creation failed |
When running consistency_check.py standalone, note the printed approach_id and set it manually in report.py, code_snapshot.py, and merge_request.py.
Known Limitations
SQL and Assembler precision is low. Prompts assume OO constructs; SQL and Assembler exercises don't map cleanly onto this ontology.
LLM non-determinism. Results vary between runs. Average at least 3 runs per approach for reliable comparisons.
Course deletion may not complete on first attempt due to network latency. Re-run — the step is idempotent.
Exercise import may fail if local VCS already contains exercises with the same name. Remove all folders inside repos/ and local-vcs-repos/ in the Artemis project root, then re-run.

