Skip to main content

Consistency Check

This page covers the pipeline architecture, language-aware context filter, benchmark results, and PECV-bench evaluation workflow. For instructor-facing usage documentation see Consistency Check.


Pipeline Architecture

Consistency Check Pipeline — Activity Diagram
Consistency Check Pipeline — Activity Diagram

Three-phase pipeline in HyperionConsistencyCheckService:

Phase 1 — Language-Aware Context Filtering A ProgrammingExercise enters the ConsistencyCheckService. The Context Renderer Service invokes HyperionProgrammingLanguageContextFilterService (Context Filter) to strip build artifacts, IDE configs, binaries, and files over 100 KB. Language-specific exclusions applied on top. Filtered, line-numbered artifact text is the output.

Phase 2 — Structural and Semantic Analysis (concurrent) Two prompts run in parallel via Reactor with Spring AI's ChatClient (fork/join bars in the diagram):

  • Structural checker (consistency_structural.st) — return-type, parameter, constructor, attribute-type, visibility mismatches
  • Semantic checker (consistency_semantic.st) — identifier naming inconsistencies

Both outputs merge into a combined Consistency Issues list.

Phase 3 — Verification Pass (serial) consistency_verification.st receives the combined list. Removes false positives, deduplicates, corrects line ranges and descriptions, self-verifies. Returns Verified Consistency Issues. Fails gracefully — falls back to pre-verification list on error.


Language-Aware Context Filter

HyperionProgrammingLanguageContextFilterService uses a Strategy pattern — one ExclusionStrategy registered per ProgrammingLanguage.

Global Exclusions

Applied regardless of language: VCS/IDE dirs (.git/, .idea/), build outputs (target/, build/, dist/), dependency dirs (node_modules/, __pycache__/), binary extensions (.class, .jar, .o, .dll, .exe, .png, .zip, …), exercise metadata files.

Per-Language Exclusions

LanguageAdditional exclusions
JavaGradle/Maven wrappers (gradlew*, mvnw*), Eclipse files (.settings/, .classpath, .project)
PythonVirtual environments (venv/, .venv/, env/), bytecode (*.pyc, *.egg-info/)
CCMake build dirs (cmake-build-*/), CMakeCache.txt
AssemblerGlobal only
Swift.swiftpm/, Package.resolved
SQLGlobal only

After pattern matching, a binary safety net allows only known-safe extensions and exact filenames (Dockerfile, Makefile, Jenkinsfile, …). Files over 100 KB are dropped.

Adding a New Language

// HyperionProgrammingLanguageContextFilterService constructor
register(new ExclusionStrategy(ProgrammingLanguage.KOTLIN,
List.of("glob:**/.gradle/**", "glob:**/build/**")));

Benchmark Results

Generated by PECV-bench. All Artemis approaches: azure-openai-gpt-5-mini, reasoning_effort=medium.

V1 (91 variants, Java only, 3 runs per variant)

ApproachN runsTPFPFNPrecisionRecallF1Span F1Avg Time (s)Avg Cost (€)
artemis-HEAD-66c9e6b98d — Baseline (no filter)3267214120.5550.9570.7030.43119.40.009
artemis-develop-e2ee1d1f1c — Language Filter3260169190.6060.9320.7340.42719.40.009
artemis-feature-hyperion-unified-consistency-check-32471b9a64 — Unified Checker326534140.8860.9500.9170.44565.10.006
artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd — Verification Checker (current)325931200.8930.9280.9100.56435.90.014
pecv-reference (openai:gpt-5-mini)3262197170.5710.9390.7100.43331.6

V2 (325 variants, 6 languages, 1 run per variant)

ApproachN runsTPFPFNPrecisionRecallF1Span F1Avg Time (s)Avg Cost (€)
artemis-HEAD-66c9e6b98d — Baseline (no filter)1260230710.5310.7850.6330.62527.30.012
artemis-develop-e2ee1d1f1c — Language Filter1256204750.5570.7730.6470.61621.40.012
artemis-feature-hyperion-unified-consistency-check-32471b9a64 — Unified Checker125076810.7670.7550.7610.60467.40.007
artemis-feature-hyperion-consistency_check_independent_verification_loop-12d8f2a6bd — Verification Checker (current)125456770.8190.7670.7930.50144.30.018
pecv-reference (openai:gpt-5-mini)1267207640.5630.8070.6630.63036.7

The verification checker is the current production approach: 76% fewer false positives than the baseline on V2, recall above the 0.75 threshold.


Benchmarking Workflow

Benchmarking Workflow — Activity Diagram
Benchmarking Workflow — Activity Diagram

Overview

The workflow runs end-to-end via run_pecv_bench.py and connects Artemis to two external repositories:

  • PECV-bench — public: evaluation framework, public exercise dataset, ground-truth annotations, accumulated approach results.
  • PECV-bench-dataset — private: confidential solution repos and test suites. Directory structure mirrors data/ in PECV-bench; both are merged before variant materialization.

The diagram shows the three swim lanes: Developer triggers the run; Artemis handles course management, exercise import, and consistency check execution; PECV-bench materializes variants, prepares exercise ZIPs, saves results, and generates the report. The create_course flag in config.ini controls whether the workflow creates a new benchmark course or reuses an existing one (diamond in the diagram).

Confidential Exercise Storage

Store confidential artifacts in PECV-bench-dataset at the same relative path they would occupy under data/ in PECV-bench. Paths must match exactly.

Dataset Versions

VersionExercisesVariantsLanguages
V13 (ITP2425)91Java
V216 (ERA2021, ISE22, ITP2425, MTG26, IOS26, QCSL25)325Assembler, C, Java, Python, SQL, Swift

Prerequisites

  1. Running Artemis instance with Hyperion and LLM configured.
  2. Git access to PECV-bench and PECV-bench-dataset.
  3. Python 3.13 with pip.
  4. Configured config.ini (see below).

Configuration

supporting_scripts/hyperion/consistency-check-benchmark/config.ini:

[Settings]
admin_user = artemis_admin
admin_password = artemis_admin
max_threads = 5
server_url = http://localhost:8080/api

[PECVCourseSettings]
pecv_bench_dir = pecv-bench
pecv_bench_dataset_dir = pecv-bench-dataset

[PECVExerciseSettings]
dataset_version = V2

[PECVConsistencyCheckSettings]
model_name = azure-openai-gpt-5-mini
model_effort = medium
consistency_check_exercises = {"V2": {"ITP2425": ["H01E01-Lectures", ...]}}

model_name and model_effort must match spring.ai.azure.openai.chat.options.deployment-name in Artemis.

Running the Workflow

  1. Navigate to the benchmark folder:
cd supporting_scripts/hyperion/consistency-check-benchmark
  1. Create a virtual environment (Python 3.13):
python3.13 -m venv venv
  1. Activate:

    • macOS/Linux: source venv/bin/activate
    • Windows: venv\Scripts\activate
  2. Install dependencies:

    • macOS/Linux: pip install -r requirements.txt
    • Windows:
      python -m pip install --upgrade pip
      python -m pip install -r requirements.txt
      Make patch.exe available (Git for Windows):
      setx PATH "$env:PATH;C:\Program Files\Git\usr\bin"
      Restart PowerShell, then continue.
  3. Authenticate with GitHub CLI:

gh auth login
  1. Run:
    • macOS/Linux: python3 run_pecv_bench.py
    • Windows: python run_pecv_bench.py

The script executes the following steps in sequence:

StepsWhat happens
1–3Clone/pull PECV-bench and PECV-bench-dataset; install PECV-bench Python package
4Materialize exercise variants (copy base + apply patch)
5–8Create session; create or reuse benchmark course on Artemis
9–10Build variant ZIPs; import into Artemis via programming-exercise import API
11–12Fetch exercise IDs; trigger consistency checks concurrently
13Generate aggregated report; create Hyperion source snapshot ZIP
14Open PR against artemis-approaches in PECV-bench with results

Approach Identity

artemis-<branch-slashes-to-dashes-hyphens-to-underscores>-<short-commit>
# e.g. branch feature/hyperion/verify-checker → artemis-feature-hyperion-verify_checker-a1b2c3d

Script replaces /- and -_ in the branch name. Use the printed approach_id from logs — do not reconstruct manually.

Running Steps Independently

Logs indicate exactly which step failed and which script to rerun.

ModuleEntry pointWhen to rerun
exercises.pypython exercises.pyVariant materialization or ZIP creation failed
course.pypython course.pyCourse creation or exercise ID retrieval failed
consistency_check.pypython consistency_check.pyOne or more checks failed; most common rerun
report.pypython report.pyReport generation failed after checks completed
code_snapshot.pypython code_snapshot.pySnapshot ZIP not created
merge_request.pypython merge_request.pyPR creation failed

When running consistency_check.py standalone, note the printed approach_id and set it manually in report.py, code_snapshot.py, and merge_request.py.


Known Limitations

SQL and Assembler precision is low. Prompts assume OO constructs; SQL and Assembler exercises don't map cleanly onto this ontology.

LLM non-determinism. Results vary between runs. Average at least 3 runs per approach for reliable comparisons.

Course deletion may not complete on first attempt due to network latency. Re-run — the step is idempotent.

Exercise import may fail if local VCS already contains exercises with the same name. Remove all folders inside repos/ and local-vcs-repos/ in the Artemis project root, then re-run.