Project case study

Human-AI Fairness Audit Lab

Case study for a synthetic, non-hiring Human-AI Fairness Audit Lab evaluating an AI-assisted candidate-review workflow.

A public, reproducible audit of a synthetic AI-assisted candidate-review workflow. The lab is designed to show how model metrics, human review behavior, subgroup outcomes, counterfactual sensitivity, escalation rules, and monitoring checks can be evaluated together.

Synthetic, non-hiring demonstration: the lab uses synthetic applicants and simulated model/reviewer decisions. It is an evaluation and research demonstration, not a hiring recommendation system, and its statistical disparities do not establish real-world discrimination, legal compliance, causal effects, or real-world model validity.

Open Live Demo View Source

Training: 6,000 synthetic cases
Validation: 2,000 synthetic cases
Held-out audit: 2,400 synthetic cases
Monitoring: 800 synthetic cases
Automated tests: 56 in the current release record
Stack: Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest

Problem and intended use

AI-assisted decision workflows can look acceptable when only aggregate model metrics are reviewed. In high-impact contexts, evaluation also needs to examine whether outputs behave consistently across subgroups, whether human review changes or amplifies patterns, whether escalation policies work as intended, and whether the system continues to behave as expected over time.

The lab demonstrates a structured audit approach on synthetic data. Its intended use is methodological: show how a reproducible fairness and workflow evaluation can be organized, documented, tested, and monitored.

System design

Model-level evaluation

Held-out synthetic performance, subgroup metrics, supported intersectional slices, calibration, and threshold behavior.

Workflow evaluation

Simulated reviewer decisions, human overrides, escalation policy behavior, and end-to-end extraction sensitivity.

Responsible documentation

Dataset/model cards, fairness test plan, audit report, monitoring plan, and decision log.

Selected held-out synthetic findings

These values describe the synthetic audit release and should not be generalized to real hiring or selection settings.

Check	Model v1	Model v2
Held-out synthetic AUC	0.736	0.746
Direct-input recommendation flips	7.3%	0.0%
End-to-end extraction-sensitivity flips	11.0%	1.1%
Observed supported max-min TPR gaps	22.6%	12.6%

Uncertainty caveat: the bootstrap interval for the v1-minus-v2 TPR-gap change includes zero, so the observed difference should not be presented as a statistically established improvement.

Architecture

Reproducible audit artifacts

The project pairs an interactive Streamlit app with tested Python audit logic and documentation artifacts. The emphasis is on inspectability: a reader can see what was evaluated, which checks passed, what changed across release records, and where the synthetic demonstration's boundaries begin and end.

Python, pandas, scikit-learn, Fairlearn, DuckDB, Streamlit, pytest
Direct-input invariance and end-to-end extraction-sensitivity suites
Dataset cards, model cards, fairness test plan, audit report, monitoring plan, and decision log
Monitoring regression checks for release-to-release review

Explore

Review the demo and source repository

Live Demo GitHub Repository