Project case study

Human-AI Fairness Audit Lab

Case study for a synthetic, non-hiring Human-AI Fairness Audit Lab evaluating an AI-assisted candidate-review workflow.

A public, reproducible audit of a synthetic AI-assisted candidate-review workflow. The lab is designed to show how model metrics, human review behavior, subgroup outcomes, counterfactual sensitivity, escalation rules, and monitoring checks can be evaluated together.

Synthetic, non-hiring demonstration: the lab uses synthetic applicants and simulated model/reviewer decisions. It is an evaluation and research demonstration, not a hiring recommendation system, and its statistical disparities do not establish real-world discrimination, legal compliance, causal effects, or real-world model validity.

Training
6,000 synthetic cases
Validation
2,000 synthetic cases
Held-out audit
2,400 synthetic cases
Monitoring
800 synthetic cases
Automated tests
56 in the current release record
Stack
Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest

Problem and intended use

AI-assisted decision workflows can look acceptable when only aggregate model metrics are reviewed. In high-impact contexts, evaluation also needs to examine whether outputs behave consistently across subgroups, whether human review changes or amplifies patterns, whether escalation policies work as intended, and whether the system continues to behave as expected over time.

The lab demonstrates a structured audit approach on synthetic data. Its intended use is methodological: show how a reproducible fairness and workflow evaluation can be organized, documented, tested, and monitored.

System design

Model-level evaluation

Held-out synthetic performance, subgroup metrics, supported intersectional slices, calibration, and threshold behavior.

Workflow evaluation

Simulated reviewer decisions, human overrides, escalation policy behavior, and end-to-end extraction sensitivity.

Responsible documentation

Dataset/model cards, fairness test plan, audit report, monitoring plan, and decision log.

Selected held-out synthetic findings

These values describe the synthetic audit release and should not be generalized to real hiring or selection settings.

CheckModel v1Model v2
Held-out synthetic AUC0.7360.746
Direct-input recommendation flips7.3%0.0%
End-to-end extraction-sensitivity flips11.0%1.1%
Observed supported max-min TPR gaps22.6%12.6%

Uncertainty caveat: the bootstrap interval for the v1-minus-v2 TPR-gap change includes zero, so the observed difference should not be presented as a statistically established improvement.

Architecture

Reproducible audit artifacts

The project pairs an interactive Streamlit app with tested Python audit logic and documentation artifacts. The emphasis is on inspectability: a reader can see what was evaluated, which checks passed, what changed across release records, and where the synthetic demonstration's boundaries begin and end.

  • Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest
  • Direct-input invariance and end-to-end extraction-sensitivity suites
  • Dataset cards, model cards, fairness test plan, audit report, monitoring plan, and decision log
  • Monitoring regression checks for release-to-release review

Explore