Platform

Three infrastructure layers and six engineering tracks. Every project takes the exact subset it needs and nothing more.

A telecom operator, an investment business, and a transport system all have different workflows and constraints. The engineering problems still overlap: all of them need inference, guardrails, and document work. The tuning is what changes. In finance, filters block investment recommendations. In transport, they block hallucinated incidents. In telecom, they block tone violations and answers the agent should never give.

Below is the full map. Rows are three layers with different degrees of customization. Columns are six engineering tracks. Each cell falls into one of three types. Open source means mature components that do not need rewriting: vLLM for inference, Langfuse for observability, Qdrant for vector search. The Manaraga platform layer contains modules we move between projects and harden on every deployment: agent orchestration, chat, inference scaling, corporate tone. Custom development is the code that remains process-specific: RAG pipelines, domain agents, CRM and ERP integrations.

Module / Layer →
01 Model infrastructure open source
02 Agent platform our reusable layer
03 Project-specific custom code
01 Inference & routing
Model Serving vLLM × 4 types
Inference Optimization batching · KV-cache
Prompt Caching prefix caching
LLM Gateway LiteLLM
Access Control project policies
Rate Limiting quotas · priorities
Auto-scaling GPU distribution
02 Observability
LLM Observability OTel · Langfuse · CH
Agent Analytics funnels · deflection · cost
Cost Tracking project budgets
03 Guardrails
Guardrails Engine LiteLLM + rules
Sensitive Data PII · PHI · PCI · privacy
Attack Protection injection · jailbreak
04 Evaluation
Real-time Eval Langfuse · LLM judges
Agent Training production cases
Synthetic Datasets domain-specific gen
Evals project-specific
Fine-tuning domain adaptation
05 Documents
Vector Database Qdrant
RAG Pipelines Temporal · client data
Knowledge Maps document graphs
06 Agents
Orchestration pydantic-ai · MCP · A2A
AI-native Chat streaming · threads
Service Agent routing · escalation
Memory session + long-term
Custom Agents per business process
Integration Adapters CRM · ERP · API
Content Digital Twin brand voice & tone
Security
Data protectionAI SafetyAccess controlAudit
Open source Manaraga platform Custom development

Three layers

Every project is assembled from three layers. The bottom layer does not depend on the industry. The middle layer is reused across projects. The top layer is written for the exact business workflow. Security is not a separate layer but a cross-cutting requirement: data masking, attack filtering, decision audit, and access control live inside every component.

Model infrastructure

Hosting, request routing, vector databases. Mature open source already works here; our job is to tune it for enterprise load.

Agent platform

Observability, guardrails, evaluation, orchestration, and memory. This is where most of our own reusable engineering work lives.

Project-specific development

Client document search, domain agents, CRM and ERP connectors, synthetic datasets, fine-tuning. Code written for the business process and handed over to the client.

Modules

01

Inference and routing

Every project needs several compute profiles at once: classification, generation, vectorization, each with different latency and cost constraints. One model and one shared pool of capacity do not work in an enterprise environment. Tasks compete for resources, and a single provider outage can freeze the whole system — exactly what happened on the transport project before we split workloads across separate instances with automatic fallback.

We split inference into four GPU instance types: reasoning, fast generation, vectorization, and vision. The router distributes requests, flips to a backup model on failure, and enforces quotas and priorities by project.

Model Serving vLLM × 4 instance types Auto-scaling model placement across GPUs Inference Optimization batching · KV-cache · prefix caching LLM Gateway LiteLLM · fallback · business-priority quotas Access control project and role policies
02

Observability

Latency and error rate do not explain why the agent responded the way it did or how much one outcome actually cost. On the telecom project it was business metrics, not infrastructure metrics, that showed where the agent already beat the operator and where it had to stay out of the loop.

We collect two layers of metrics. The engineering layer traces every call, every tool chain, and token-level cost. The business layer tracks support funnels, automation share, and cost per outcome.

LLM Observability OpenTelemetry · Langfuse · ClickHouse Agent Analytics funnels · deflection rate · cost per outcome Cost Tracking token accounting · project budgets
03

Guardrails

Prompt injection and data leakage are baseline threats, and standard libraries can catch them. But every industry also has its own prohibitions that no generic library understands. In the investment project, the agent began hinting at answers to qualification tests — something the regulator bans outright.

We place filtering on both sides of every model call: data masking, attack detection, and business-specific rules. In banking we built a multi-layer compliance system with prompt constraints, refusal scenarios, checker loops, and audit logs for every answer.

Input / Output Filtering custom rules on top of LiteLLM Sensitive data PII · PHI · PCI · local privacy law Attack protection prompt injection · jailbreak

In the investment project, compliance became a multi-layer stack: prompt constraints, refusal paths, checker loops, and audit of every answer. Case →

04

Evaluation

Quality cannot be checked once and forgotten. Models change, data changes, prompts drift, and answers get worse. On the transport project, a binary “confident / not confident” threshold created too many false escalations: the system sent tickets to operators that it was actually capable of answering safely.

We built a three-pass confidence formula with 30+ parameters calibrated on real requests. It decides when the agent can answer and when a human is still required. Alongside it run LLM judges, production benchmark cases, and synthetic datasets so quality regressions are caught before production, not after a complaint.

Real-time Evaluation Langfuse · LLM judges Agent Training production benchmark cases Synthetic Datasets domain-specific generation Regression Testing project-specific eval suites Fine-tuning adaptation to the domain and terminology

In the transport project, the key artifact was a three-pass confidence formula with more than thirty production-calibrated parameters. Case →

05

Documents

Every company has its own regulations, knowledge base, and normative documentation. Standard RAG can retrieve a vaguely similar paragraph, but enterprise tasks are stricter. On the telecom project, tariff questions required exact numbers from tables while vector search kept returning approximate narrative matches.

We built a dual index: one branch for semantic retrieval and another for exact data such as tariff tables, prices, and technical parameters. Pure vector search tends to lose numbers and tables because they vectorize badly.

Vector Database Qdrant RAG Pipelines Temporal · tuned to client data Knowledge Maps document relationship graphs

In the telecom project, a dual index separated semantic search from exact-table retrieval with prices and parameters. Case →

06

Agents

An agent in a demo answers questions. An agent in production must remember context across sessions, call tools, follow a scenario, and escalate to a human at the right boundary. In the finance project, the agent ran a strict sales funnel, remembered previous client conversations without mixing products, and stayed inside compliance limits — effectively a finite-state machine with multiple control loops.

We built orchestration, chat, and memory infrastructure so it does not need to be rebuilt on every project. A dedicated component, Content Digital Twin, is responsible for corporate tone: it took more than sixty iterations before the agent sounded like an actual employee rather than a chatbot.

Orchestration pydantic-ai · MCP · A2A AI-native Chat streaming · threads · auth Service Agent request handling · routing · escalation Memory session context + cross-session knowledge Custom Agents per business process Integration Adapters CRM · ERP · ticketing · internal APIs Content Digital Twin brand voice, tone, and terminology

How a project is assembled

Every project pulls its own subset from the map. Observability and evaluation are required everywhere. Guardrails are tuned to the industry: multi-layer compliance in finance, hallucination filters in transport, corporate tone and escalation boundaries in telecom. Document and agent modules are always assembled around the exact process.

Everything is deployed inside the client perimeter. Every component ships as a standard container.

18 services we tuned across projects in four industries
INFERENCE AND ROUTING
vLLM slow thinking
reasoning model
vLLM fast generation
fast generation
vLLM embedding
vectorization
vLLM vision
image and document processing
LiteLLM gateway
single API, fallback, compliance guardrails
PostgreSQL config
settings, virtual keys, access policies
OBSERVABILITY AND EVALUATION
OpenTelemetry Collector telemetry
trace collection and routing
Langfuse web + worker
UI, dashboards, eval flows, datasets
ClickHouse storage
trace and eval result storage
Redis queues
background processing queues
DOCUMENTS AND PIPELINES
Temporal server + web + admin
pipeline orchestration and monitoring
Qdrant vector index
document, contract, and knowledge-base chunks
S3 storage
documents, guidelines, media
AGENTS
Agents Service runtime
agent business logic and session management
PostgreSQL history
dialog history and session state
18 services · deployed inside the client perimeter · every component ships as a standard container

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai