harryliu.space / projects / jobhunter
Project · Case Study

JobHunter

An open-source multi-agent pipeline that reads the job market daily, honestly, and at the cost of a coffee.

2026 Personal build · open-sourced github.com/hazzaliu/JobHunter
§ 01 — Context

A volume problem with an honesty problem sitting inside it.

Job searching in Melbourne's data and AI market in 2026 is a volume problem. Two hundred-plus new listings a day across multiple boards. A hundred-plus applicants per role. The quality of the match is often hard to tell from the description alone. The default workflow — manual review plus a one-size-fits-all CV — wastes time on bad matches and undersells the good ones.

Underneath the volume problem is an honesty problem. A single LLM asked "is this job a good fit for Harry?" will find reasons to say yes. Language models, by default, are cooperative. The moments in a job search where you most need a cold second opinion are the moments an LLM will cheerfully supply the hottest first opinion.

I wanted a system that read every relevant listing daily, scored them with honesty rather than optimistic enthusiasm, produced tailored application materials only for the genuinely promising ones, learned from my apply/skip decisions over time, and cost less than a coffee per day to operate. That last constraint shaped the design more than any other.

§ 02 — The product decisions

Two decisions that shaped everything downstream.

First: a staged pipeline, not a single model. Running every listing through a frontier LLM costs around $10-20 a day and is wasteful — 95% of listings aren't a fit, and cheap models can identify that. So embeddings and rerankers (open-source, running on CPU, effectively free) handle the first 95%. The expensive LLM only sees the top 20 of 200+.

Second: adversarial evaluation as a first-class component. Instead of one LLM scoring a job, three do. A Seniority agent, a Skills agent, and a Devil's Advocate whose only job is to find reasons not to apply. The panel's combined score is the honest one — because one of its voices is paid, structurally, to disagree.

A single LLM scoring a CV against a job is optimistic by default. It tells you what you want to hear. An adversarial agent isn't smarter — it's structurally biased the other way, which is how you recover honest signal.

The feedback loop closes it. Every apply/skip decision is logged. Companies skipped three times get auto-blocked from future recommendations. Lane-level conversion analysis surfaces which job sources yield the strongest matches. The pipeline gets sharper with use, not drift-ier.

§ 03 — How it works

Five stages. Cheap filters first, expensive reasoning last.

01
Scrape & dedup — multi-board, fingerprint on title + company + location + post date
200+ jobs
$0.00
02
Embedding similarity — SentenceTransformers on CPU, all jobs scored against my target lanes
top 40
$0.00
03
Cross-encoder rerank — FlashRank refines with a slower but more accurate model
top 20
$0.00
04
Three-agent panel — Seniority, Skills, Devil's Advocate; each prompted with role-specific constraints
top 5
~$0.50
05
Material generation — evidence-bank-injected CV + cover letter, tailored per candidate
top 5
~$0.50
Daily run · cron · Notion + Discord delivery 95% filtered at $0 · ~$1/day

A few details worth naming:

  • The evidence bank. A Notion database with structured career bullets — each with a STAR story, technologies, strengths, and metadata about when to use them. A keyword match against the scored job pulls the most relevant evidence into the prompt for material generation. CVs are composed, not templated.
  • Lane-aware classification. Jobs get classified into one of three career lanes (applied ML, analytics, data PM) before scoring, using a deterministic keyword-based classifier (zero LLM cost). Each lane has its own positioning, CV base, and priority skills — so the material generation runs with lane-specific context.
  • The Devil's Advocate prompt. Not "score this job negatively" — that produces low-quality adversarial text. The real prompt asks: "Find the three strongest reasons this candidate would not make it past a phone screen for this role. Be specific." That reframing produces useful criticism instead of performative pessimism.

The evaluation story, honestly. My production eval signal is the feedback loop — apply rate, skip rate by source, skip reason patterns. When the skip rate climbs above baseline, something drifted: either the scorer, my preferences, or the market. That's weaker than a proper regression test on a golden set, but it's the honest version of continuous eval for a single-user system. For a team-scale system, I'd invest in the golden set and per-prompt regression testing — this is the place I'd expect to learn more than I teach.

§ 04 — Outcome

A system that compounds with use.

95%
of candidates filtered at zero API cost

Runs daily, unattended. Morning Discord briefing delivers top five with scores and reasoning. Application materials are ready the moment I decide to apply. Surfaced roles in the first two weeks that I wouldn't have found via manual review. Total operational spend since launch: under $30. Open-sourced on GitHub under the premise that the architecture pattern — cheap filter, expensive reasoning, adversarial evaluation, feedback loop — generalises to any retrieval-scored-application workflow, not just jobs.

~$1/day
Operating cost
13 modules
Python, open source
3-agent
Honest scoring panel
§ 05 — What I learned

Cost shape is a product decision.

Two product-level lessons worth carrying forward.

Cost shape is a first-class design constraint, not an operational concern. The difference between a system that costs $1/day and $15/day isn't money — it's whether you actually use it. The $1/day system runs every day. The $15/day one gets turned off after a week. When I'm thinking about AI product features, "what does this cost per use at scale" is now one of the first questions, not one of the last.

Honest eval is worth engineering for. The Devil's Advocate pattern is structurally simple — a third agent, a specific prompt, a consensus score — but it prevents the category of failure where the model produces CVs for jobs that will never call back. Anyone can throw an LLM at a problem. Knowing when the LLM is lying to you is the harder craft, and it's where data science training pays off in AI product work.

Generation is commoditised. Knowing when the generator is wrong is not.