Operator of an Urban Transport System

Challenge

On the request “I can’t log in to my account,” the model once generated: “there is a major technical incident in the system, the developer team is investigating.” No such incident existed. If that answer had reached the driver, they might have decided the platform was down and skipped their shift.

That is the key constraint of the project: in a transport system with hundreds of thousands of registered drivers, one hallucination can turn into real downtime.

What broke on real emails

The first prototype was assembled in a week: search the knowledge base, generate the answer, send the email. Test data looked convincing. Production email did not.

The first failure was not the model, but the dataset. The knowledge base was created from QA review tables that mixed real driver questions with service labels like “spam,” “moderation,” or “operator work.” Without cleaning, the search index would happily return those labels as if they were correct answers. We had to add several filtering layers before the data was even allowed into the system.

The second failure was intent drift. A question about updating a phone number could retrieve a document about account management and lead the model to suggest deleting the account and creating a new one. Formally related. Operationally disastrous.

The third failure was hallucination. The answer looked relevant, polite, and structurally correct, but it invented a platform incident that did not exist. That is why the system now includes an explicit hard filter: if an “incident” marker appears in the generated answer and does not appear in any retrieved source, the answer is blocked.

Confidence routing

The system uses a multi-part confidence formula to decide the route. Retrieval quality, model self-assessment, and explicit penalties all feed into the final score. A strong answer is sent automatically. Anything below the threshold is escalated to an operator instead of being guessed at.

This formula grew from concrete failures: links the model invented, category mismatches, and answers that asked the driver for screenshots even though the bot could not process the follow-up.

Result

The result is not “a bot that answers email.” It is a guarded support pipeline: retrieval, moderation, validation, confidence scoring, and explicit routing rules all work together so that the system stays useful without becoming operationally dangerous.

Platform modules used in this project

Documents Qdrant

A dual retrieval layer: BM25 for keyword search and Qdrant for semantic search, plus intent detection across six driver-support categories.

Inference

Qwen 3 235B handles generation, moderation, and reranking. Qwen-3-Embedding-0.6B powers vector search. The stack adapts batch size when providers fail.

Guardrails

Morphology-aware toxicity filtering, business-tone safeguards, LLM moderation with BM25 bypass for domain questions, plus output checks for PII, URLs, intent mismatch, and incident hallucinations.

Evaluation

A three-pass confidence formula combining retrieval geometry, model self-report with floor correction, and penalties calibrated on production requests.

LLM Router

LiteLLM proxy routes between Qwen and DeepSeek, handles fallbacks, and exposes one API across support channels.

All platform modules →