Project case study
Human-AI Fairness Audit Lab
Case study for a synthetic, non-hiring Human-AI Fairness Audit Lab evaluating an AI-assisted candidate-review workflow.
A public, reproducible audit of a synthetic AI-assisted candidate-review workflow. The lab is designed to show how model metrics, human review behavior, subgroup outcomes, counterfactual sensitivity, escalation rules, and monitoring checks can be evaluated together.
Synthetic, non-hiring demonstration: the lab uses synthetic applicants and simulated model/reviewer decisions. It is an evaluation and research demonstration, not a hiring recommendation system, and its statistical disparities do not establish real-world discrimination, legal compliance, causal effects, or real-world model validity.
- Training
- 6,000 synthetic cases
- Validation
- 2,000 synthetic cases
- Held-out audit
- 2,400 synthetic cases
- Monitoring
- 800 synthetic cases
- Automated tests
- 56 in the current release record
- Stack
- Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest
Problem and intended use
AI-assisted decision workflows can look acceptable when only aggregate model metrics are reviewed. In high-impact contexts, evaluation also needs to examine whether outputs behave consistently across subgroups, whether human review changes or amplifies patterns, whether escalation policies work as intended, and whether the system continues to behave as expected over time.
The lab demonstrates a structured audit approach on synthetic data. Its intended use is methodological: show how a reproducible fairness and workflow evaluation can be organized, documented, tested, and monitored.
System design
Model-level evaluation
Held-out synthetic performance, subgroup metrics, supported intersectional slices, calibration, and threshold behavior.
Workflow evaluation
Simulated reviewer decisions, human overrides, escalation policy behavior, and end-to-end extraction sensitivity.
Responsible documentation
Dataset/model cards, fairness test plan, audit report, monitoring plan, and decision log.
Selected held-out synthetic findings
These values describe the synthetic audit release and should not be generalized to real hiring or selection settings.
| Check | Model v1 | Model v2 |
|---|---|---|
| Held-out synthetic AUC | 0.736 | 0.746 |
| Direct-input recommendation flips | 7.3% | 0.0% |
| End-to-end extraction-sensitivity flips | 11.0% | 1.1% |
| Observed supported max-min TPR gaps | 22.6% | 12.6% |
Uncertainty caveat: the bootstrap interval for the v1-minus-v2 TPR-gap change includes zero, so the observed difference should not be presented as a statistically established improvement.
Architecture
Reproducible audit artifacts
The project pairs an interactive Streamlit app with tested Python audit logic and documentation artifacts. The emphasis is on inspectability: a reader can see what was evaluated, which checks passed, what changed across release records, and where the synthetic demonstration's boundaries begin and end.
- Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest
- Direct-input invariance and end-to-end extraction-sensitivity suites
- Dataset cards, model cards, fairness test plan, audit report, monitoring plan, and decision log
- Monitoring regression checks for release-to-release review
Explore