Consistency Check

This page covers the pipeline architecture, language-aware context filter, benchmark results, and PECV-bench evaluation workflow. For instructor-facing usage documentation see Consistency Check.

Pipeline Architecture

Consistency Check Pipeline — Activity Diagram

Three-phase pipeline in HyperionConsistencyCheckService:

Phase 1 — Language-Aware Context Filtering A ProgrammingExercise enters the ConsistencyCheckService. The Context Renderer Service invokes HyperionProgrammingLanguageContextFilterService (Context Filter) to strip build artifacts, IDE configs, binaries, and files over 100 KB. Language-specific exclusions applied on top. Filtered, line-numbered artifact text is the output.

Phase 2 — Structural and Semantic Analysis (concurrent) Two prompts run in parallel via Reactor with Spring AI's ChatClient (fork/join bars in the diagram):

Structural checker (consistency_structural.st) — return-type, parameter, constructor, attribute-type, visibility mismatches
Semantic checker (consistency_semantic.st) — identifier naming inconsistencies

Both outputs merge into a combined Consistency Issues list.

Phase 3 — Verification Pass (serial) consistency_verification.st receives the combined list. Removes false positives, deduplicates, corrects line ranges and descriptions, self-verifies. Returns Verified Consistency Issues. Fails gracefully — falls back to pre-verification list on error.

Language-Aware Context Filter

HyperionProgrammingLanguageContextFilterService uses a Strategy pattern — one ExclusionStrategy registered per ProgrammingLanguage.

Global Exclusions

Applied regardless of language: VCS/IDE dirs (.git/, .idea/), build outputs (target/, build/, dist/), dependency dirs (node_modules/, __pycache__/), binary extensions (.class, .jar, .o, .dll, .exe, .png, .zip, …), exercise metadata files.

Per-Language Exclusions

Language	Additional exclusions
Java	Gradle/Maven wrappers (`gradlew`, `mvnw`), Eclipse files (`.settings/`, `.classpath`, `.project`)
Python	Virtual environments (`venv/`, `.venv/`, `env/`), bytecode (`.pyc`, `.egg-info/`)
C	CMake build dirs (`cmake-build-*/`), `CMakeCache.txt`
Assembler	Global only
Swift	`.swiftpm/`, `Package.resolved`
SQL	Global only

After pattern matching, a binary safety net allows only known-safe extensions and exact filenames (Dockerfile, Makefile, Jenkinsfile, …). Files over 100 KB are dropped.

Adding a New Language

// HyperionProgrammingLanguageContextFilterService constructor
register(new ExclusionStrategy(ProgrammingLanguage.KOTLIN,
        List.of("glob:**/.gradle/**", "glob:**/build/**")));

Benchmark Results

Generated by PECV-bench. All Artemis approaches: azure-openai-gpt-5-mini, reasoning_effort=medium.

V1 (91 variants, Java only, 3 runs per variant)

Approach	N runs	TP	FP	FN	Precision	Recall	F1	Span F1	Avg Time (s)	Avg Cost (€)
`artemis-HEAD-66c9e6b98d` — Baseline (no filter)	3	267	214	12	0.555	0.957	0.703	0.431	19.4	0.009
`artemis-develop-e2ee1d1f1c` — Language Filter	3	260	169	19	0.606	0.932	0.734	0.427	19.4	0.009
`artemis-feature-hyperion-unified-consistency-check-32471b9a64` — Unified Checker	3	265	34	14	0.886	0.950	0.917	0.445	65.1	0.006
`artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd` — Verification Checker (current)	3	259	31	20	0.893	0.928	0.910	0.564	35.9	0.014
`pecv-reference` (`openai:gpt-5-mini`)	3	262	197	17	0.571	0.939	0.710	0.433	31.6	—

V2 (325 variants, 6 languages, 1 run per variant)

Approach	N runs	TP	FP	FN	Precision	Recall	F1	Span F1	Avg Time (s)	Avg Cost (€)
`artemis-HEAD-66c9e6b98d` — Baseline (no filter)	1	260	230	71	0.531	0.785	0.633	0.625	27.3	0.012
`artemis-develop-e2ee1d1f1c` — Language Filter	1	256	204	75	0.557	0.773	0.647	0.616	21.4	0.012
`artemis-feature-hyperion-unified-consistency-check-32471b9a64` — Unified Checker	1	250	76	81	0.767	0.755	0.761	0.604	67.4	0.007
`artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd` — Verification Checker (current)	1	254	56	77	0.819	0.767	0.793	0.501	44.3	0.018
`pecv-reference` (`openai:gpt-5-mini`)	1	267	207	64	0.563	0.807	0.663	0.630	36.7	—

The verification checker is the current production approach: 76% fewer false positives than the baseline on V2, recall above the 0.75 threshold.

Benchmarking Workflow

Overview

The workflow runs end-to-end via run_pecv_bench.py and connects Artemis to two external repositories:

PECV-bench — public: evaluation framework, public exercise dataset, ground-truth annotations, accumulated approach results.
PECV-bench-dataset — private: confidential solution repos and test suites. Directory structure mirrors data/ in PECV-bench; both are merged before variant materialization.

The diagram shows the three swim lanes: Developer triggers the run; Artemis handles course management, exercise import, and consistency check execution; PECV-bench materializes variants, prepares exercise ZIPs, saves results, and generates the report. The create_course flag in config.ini controls whether the workflow creates a new benchmark course or reuses an existing one (diamond in the diagram).

Confidential Exercise Storage

Store confidential artifacts in PECV-bench-dataset at the same relative path they would occupy under data/ in PECV-bench. Paths must match exactly.

Dataset Versions

Version	Exercises	Variants	Languages
V1	3 (ITP2425)	91	Java
V2	16 (ERA2021, ISE22, ITP2425, MTG26, IOS26, QCSL25)	325	Assembler, C, Java, Python, SQL, Swift

Prerequisites

Running Artemis instance with Hyperion and LLM configured.
Git access to PECV-bench and PECV-bench-dataset.
Python 3.13 with pip.
Configured config.ini (see below).

Configuration

supporting_scripts/hyperion/consistency-check-benchmark/config.ini:

[Settings]
admin_user = artemis_admin
admin_password = artemis_admin
max_threads = 5
server_url = http://localhost:8080/api

[PECVCourseSettings]
pecv_bench_dir = pecv-bench
pecv_bench_dataset_dir = pecv-bench-dataset

[PECVExerciseSettings]
dataset_version = V2

[PECVConsistencyCheckSettings]
model_name = azure-openai-gpt-5-mini
model_effort = medium
consistency_check_exercises = {"V2": {"ITP2425": ["H01E01-Lectures", ...]}}

model_name and model_effort must match spring.ai.azure.openai.chat.options.deployment-name in Artemis.

Running the Workflow

Navigate to the benchmark folder:

cd supporting_scripts/hyperion/consistency-check-benchmark

Create a virtual environment (Python 3.13):

python3.13 -m venv venv

Activate:
- macOS/Linux: source venv/bin/activate
- Windows: venv\Scripts\activate
Install dependencies:
- macOS/Linux: pip install -r requirements.txt
- Windows:
```
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
```
  Make patch.exe available (Git for Windows):
```
setx PATH "$env:PATH;C:\Program Files\Git\usr\bin"
```
  Restart PowerShell, then continue.
Authenticate with GitHub CLI:

gh auth login

Run:
- macOS/Linux: python3 run_pecv_bench.py
- Windows: python run_pecv_bench.py

The script executes the following steps in sequence:

Steps	What happens
1–3	Clone/pull PECV-bench and PECV-bench-dataset; install PECV-bench Python package
4	Materialize exercise variants (copy base + apply patch)
5–8	Create session; create or reuse benchmark course on Artemis
9–10	Build variant ZIPs; import into Artemis via programming-exercise import API
11–12	Fetch exercise IDs; trigger consistency checks concurrently
13	Generate aggregated report; create Hyperion source snapshot ZIP
14	Open PR against `artemis-approaches` in PECV-bench with results

Approach Identity

artemis-<branch-slashes-to-dashes-hyphens-to-underscores>-<short-commit>
# e.g. branch feature/hyperion/verify-checker → artemis-feature-hyperion-verify_checker-a1b2c3d

Script replaces / → - and - → _ in the branch name. Use the printed approach_id from logs — do not reconstruct manually.

Running Steps Independently

Logs indicate exactly which step failed and which script to rerun.

Module	Entry point	When to rerun
`exercises.py`	`python exercises.py`	Variant materialization or ZIP creation failed
`course.py`	`python course.py`	Course creation or exercise ID retrieval failed
`consistency_check.py`	`python consistency_check.py`	One or more checks failed; most common rerun
`report.py`	`python report.py`	Report generation failed after checks completed
`code_snapshot.py`	`python code_snapshot.py`	Snapshot ZIP not created
`merge_request.py`	`python merge_request.py`	PR creation failed

When running consistency_check.py standalone, note the printed approach_id and set it manually in report.py, code_snapshot.py, and merge_request.py.

Known Limitations

SQL and Assembler precision is low. Prompts assume OO constructs; SQL and Assembler exercises don't map cleanly onto this ontology.

LLM non-determinism. Results vary between runs. Average at least 3 runs per approach for reliable comparisons.

Course deletion may not complete on first attempt due to network latency. Re-run — the step is idempotent.

Exercise import may fail if local VCS already contains exercises with the same name. Remove all folders inside repos/ and local-vcs-repos/ in the Artemis project root, then re-run.

Pipeline Architecture​

Language-Aware Context Filter​

Global Exclusions​

Per-Language Exclusions​

Adding a New Language​

Benchmark Results​

V1 (91 variants, Java only, 3 runs per variant)​

V2 (325 variants, 6 languages, 1 run per variant)​

Benchmarking Workflow​

Overview​

Confidential Exercise Storage​

Dataset Versions​

Prerequisites​

Configuration​

Running the Workflow​

Approach Identity​

Running Steps Independently​

Known Limitations​

Pipeline Architecture

Language-Aware Context Filter

Global Exclusions

Per-Language Exclusions

Adding a New Language

Benchmark Results

V1 (91 variants, Java only, 3 runs per variant)

V2 (325 variants, 6 languages, 1 run per variant)

Benchmarking Workflow

Overview

Confidential Exercise Storage

Dataset Versions

Prerequisites

Configuration

Running the Workflow

Approach Identity

Running Steps Independently

Known Limitations