AI support for drivers over email and Telegram. Every filter in the system appeared after a concrete failure in production traffic.
On the request “I can’t log in to my account,” the model once generated: “there is a major technical incident in the system, the developer team is investigating.” No such incident existed. If that answer had reached the driver, they might have decided the platform was down and skipped their shift.
That is the key constraint of the project: in a transport system with hundreds of thousands of registered drivers, one hallucination can turn into real downtime.
The first prototype was assembled in a week: search the knowledge base, generate the answer, send the email. Test data looked convincing. Production email did not.
The first failure was not the model, but the dataset. The knowledge base was created from QA review tables that mixed real driver questions with service labels like “spam,” “moderation,” or “operator work.” Without cleaning, the search index would happily return those labels as if they were correct answers. We had to add several filtering layers before the data was even allowed into the system.
The second failure was intent drift. A question about updating a phone number could retrieve a document about account management and lead the model to suggest deleting the account and creating a new one. Formally related. Operationally disastrous.
The third failure was hallucination. The answer looked relevant, polite, and structurally correct, but it invented a platform incident that did not exist. That is why the system now includes an explicit hard filter: if an “incident” marker appears in the generated answer and does not appear in any retrieved source, the answer is blocked.
The system uses a multi-part confidence formula to decide the route. Retrieval quality, model self-assessment, and explicit penalties all feed into the final score. A strong answer is sent automatically. Anything below the threshold is escalated to an operator instead of being guessed at.
This formula grew from concrete failures: links the model invented, category mismatches, and answers that asked the driver for screenshots even though the bot could not process the follow-up.
The result is not “a bot that answers email.” It is a guarded support pipeline: retrieval, moderation, validation, confidence scoring, and explicit routing rules all work together so that the system stays useful without becoming operationally dangerous.
Where should it go: automatic reply or operator?
Five real requests. Try to route them the way the system does: auto-reply or escalation?
How do I get my permanent driver ID?
How the system makes these decisions in milliseconds — in the case below
A dual retrieval layer: BM25 for keyword search and Qdrant for semantic search, plus intent detection across six driver-support categories.
Qwen 3 235B handles generation, moderation, and reranking. Qwen-3-Embedding-0.6B powers vector search. The stack adapts batch size when providers fail.
Morphology-aware toxicity filtering, business-tone safeguards, LLM moderation with BM25 bypass for domain questions, plus output checks for PII, URLs, intent mismatch, and incident hallucinations.
A three-pass confidence formula combining retrieval geometry, model self-report with floor correction, and penalties calibrated on production requests.
LiteLLM proxy routes between Qwen and DeepSeek, handles fallbacks, and exposes one API across support channels.
We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.
Inquiry sent
We will reply within one business day to the email you provided.
or write directly to ilya@manaraga.ai