PHASE 2

Solution Architectures

Two of the five opportunities, taken to an architecture overview: the moving parts, the explicit model / orchestration / human-in-the-loop calls, and the assumptions and gaps behind each choice. Each runs as a single n8n workflow.

Support Ticket Triage Agent

Support

Enrich every inbound ticket with customer context, let cheap rules clear the obvious (and fast-track VIPs), then classify the rest with a kNN baseline over past labelled tickets, escalating only the uncertain middle band to an LLM adjudicator. One n8n workflow; Zendesk, the model API, the customer + vector stores, and Langfuse tracing are the only external pieces.

01 · The VisualClick a box for details · drag a box to move it · Ctrl/Shift-click a line to add a point, drag to bend (double-click removes) · hover a line to highlight · Save persists
Walkthrough
n8n workflow
webhook
account
ambiguous
confident · VIP
1 · embed
2 · search
all fields high → auto
a mid field
any field low
resolved
unsure
API call
traces
API
log decision
sampled audit
audit findings
on API error
fallback
alert
agent ratings
new labels
Inside the box runs in n8n
n8n trigger
n8n step (logic)
LLM / embeddings call
Human checkpoint
Datastore / feedback
External system / API

Out of scope (noted on purpose): customer-data privacy, tokenizing / redacting PII before it leaves for the external model APIs, is intentionally omitted here to keep the assignment focused. In production this is a real risk we'd very likely implement (a tokenize → call model → de-tokenize layer, plus key management and re-identification controls). It adds significant complexity, so it's flagged as known future work rather than drawn into the diagram. The same applies to the kNN neighbours and customer context the adjudicator reasons over: those are other customers' records, so in production we'd feed it labels / metadata or redacted text rather than raw tickets.

02 · Explicit Calls

LLM tiers

Tiered by job, and most jobs need no LLM at all. Classification is a kNN vote over embeddings; the LLM is reserved for the uncertain middle band.

  • Embeddings, used to place each ticket in the vector space for the kNN classifier. High-volume and cheap.
  • Mid tier (Sonnet), the adjudicator that resolves the uncertain middle band by reasoning over the kNN neighbours. Low volume, so cost stays contained.
  • No first-pass generative model and no frontier dependency. We start with kNN + selective Sonnet and let measured accuracy decide where, if anywhere, more model is worth it.

Orchestration

One n8n workflow, event-driven off the Zendesk webhook. A deterministic context lookup and rules run first; a kNN node classifies the rest; an n8n switch routes on the vote confidence; only the middle band hits the Sonnet adjudicator. Every external call is wrapped with timeouts and retries, and on failure the workflow falls back to a human and posts a Slack alert. Everything outside is an API the workflow calls (Zendesk, the model provider, the customer + vector stores, Slack), and every model call is traced to Langfuse.

Human-in-the-loop

Auto-route fires only above the confidence threshold; the uncertain middle band is adjudicated, and the rest, plus anything an API failure interrupts, drop to a human triage queue. Human corrections, plus a sampled audit of auto-routed tickets, flow back as new labelled examples.

Rollout

Earn autonomy in stages rather than switching on auto-routing day one. Each stage uses real data to set the thresholds before the next, which also solves the cold start (you can't pick a confidence band with zero labels).

  • Shadow: predict on live tickets but take no action; compare kNN + adjudicator decisions against the agents' actual routing to build the eval set and calibrate the bands.
  • Suggest-only: show the predicted route / label to agents as a suggestion they accept or override; measure acceptance and accuracy.
  • Auto-route: enable unattended routing above the calibrated band, with the sampled audit and the human + Slack fallback always on.

What we'd measure

Each metric with a proposed OKR target.

≥ 90%kNN auto-route accuracyThe primary classifier's quality, and whether its vote-share confidence can be trusted by the router.
≥ 70%Tickets auto-routed (vs. escalated / human)How much the cheap path absorbs, and how often the LLM is actually needed.
≥ 90%kNN mis-routes caught by adjudicator + auditWhether the LLM is correcting real kNN mistakes (earning its cost) or just agreeing.
−50%First-response timeThe named primary outcome triage is meant to move.
100%Model calls traced (Langfuse)Makes mis-routes debuggable and the spend auditable.
< 1%Fallback to manual (API errors)How often external calls fail into the human + Slack-alert path; a core reliability signal.
03 · Assumptions & Gaps

Data we'd need

  • Zendesk API access (read tickets, write tags / assignments) and webhook events.
  • Customer / CRM account data (product mix, SLA tier) for context enrichment and VIP fast-tracking.
  • A labelled set of historical tickets indexed as embeddings for the kNN classifier.
  • A held-out eval set to measure kNN accuracy, calibrate the router bands, and check whether the adjudicator beats kNN alone.
  • A Langfuse project (or similar) wired to the model calls, and a Slack channel / webhook for failure alerts.
  • An agreed category / product-module taxonomy to classify against.

Failure modes to instrument

  • Silent mis-routing, a confident wrong answer is worse than an abstention.
  • kNN confident-but-wrong on surface-similar tickets, the failure the adjudicator and sampled audit exist to catch.
  • Cold start / sparse labels, new issue types have no near neighbours, so they land low-confidence and lean on humans until the index fills in.
  • Router band miscalibration, the thresholds drift and either over- or under-automate.
  • External API outage or rate-limit (model, vector DB, Zendesk), caught by the fallback to a human plus a Slack alert, but it caps automation while it lasts.
  • Prompt injection in ticket text attempting to steer the adjudicator, contained by the label allow-list and by passing ticket text as untrusted data, never instructions.
  • Duplicate webhook delivery would double-route; deduped with an idempotency key (ticket + event id) so each event is processed exactly once.
  • Embedding-model upgrades invalidate the index (vectors aren't comparable across models), so a version bump triggers a full re-embed; the index also needs periodic dedup and pruning of retired-taxonomy labels.

Tradeoffs & why

  • Constrain outputs to a fixed label allow-list and treat ticket text as untrusted data.

    kNN can only ever emit labels that exist in the index, so the cheap path is injection-resistant by construction; the adjudicator's output is validated against the same allow-list and the ticket is passed as data, not instructions, so a malicious ticket can't steer the route or the reply.

  • A kNN classifier as the primary, with the LLM only on the uncertain middle band.

    Classification over a fixed label set is cheaper, faster, deterministic, and self-calibrating with kNN; the LLM is reserved for where a cheap classifier is genuinely weak, so cost tracks difficulty.

  • kNN plus an LLM as the cross-check, over two LLMs judging each other.

    kNN (geometric) and an LLM (reasoning) are genuinely different methods, so agreement is meaningful and disagreement is a real signal, unlike two same-family models that share blind spots.

  • Selective adjudication plus a sampled audit of auto-routes, over verifying every ticket.

    Verifying everything roughly doubles cost for little gain; sampling the confident path still catches confident-but-wrong cases without paying on every ticket.

  • Fail to a human plus a Slack alert, over failing the ticket or guessing.

    When an external API is down, the safe default is a person and a loud alert: no ticket is dropped and on-call knows immediately.

  • Langfuse tracing and an offline eval set from day one.

    You can't calibrate the router, justify the adjudicator, or debug mis-routes you can't see or measure.