Corp Disease Identifier
AI-powered disease identification for corporate health screening.
Overview
Corp Disease Identifier began as a research project around occupational health and grew into a production backend that processes anonymised health screening data. It exposes a small, sharp FastAPI surface backed by trained scikit-learn pipelines, with strict input validation and audit-friendly logging.
Highlights
- →Trained scikit-learn pipelines with versioned model artefacts
- →Pydantic v2 schemas for every request and response
- →PII-safe logging with structured JSON and request IDs
- →Dockerised inference service deployed behind a reverse proxy
Why I built it
Corporate wellness programmes generate huge amounts of screening data that nobody has time to read. I wanted a backend that could quietly flag risk patterns so clinicians could focus on the people who actually need attention, not on sifting through spreadsheets.
How it works
Screening payloads hit a single FastAPI endpoint, get validated by Pydantic, then flow into a scikit-learn pipeline loaded once at startup. The model returns a risk score and a small set of contributing features, which the API wraps with an explanation block before responding.
What I learned
Shipping ML behind a real API forces you to take input validation, versioning and observability seriously. The model is the easy part — the boring infrastructure around it is what makes the system trustworthy.
Challenges
- ×Designing a feature pipeline robust to missing or noisy screening inputs
- ×Keeping inference latency under 150ms while loading large models
- ×Building an audit trail without storing identifying patient data