Platform

Three infrastructure layers and six engineering tracks. Every project takes the exact subset it needs and nothing more.

A telecom operator, an investment business, and a transport system all have different workflows and constraints. The engineering problems still overlap: all of them need inference, guardrails, and document work. The tuning is what changes. In finance, filters block investment recommendations. In transport, they block hallucinated incidents. In telecom, they block tone violations and answers the agent should never give.

Below is the full map. Rows are three layers with different degrees of customization. Columns are six engineering tracks. Each cell falls into one of three types. Open source means mature components that do not need rewriting: vLLM for inference, Langfuse for observability, Qdrant for vector search. The Manaraga platform layer contains modules we move between projects and harden on every deployment: agent orchestration, chat, inference scaling, corporate tone. Custom development is the code that remains process-specific: RAG pipelines, domain agents, CRM and ERP integrations.

Module / Layer →

01 Model infrastructure open source

02 Agent platform our reusable layer

03 Project-specific custom code

01 Inference & routing

Model Serving vLLM × 4 types

Inference Optimization batching · KV-cache

Prompt Caching prefix caching

LLM Gateway LiteLLM

Access Control project policies

Rate Limiting quotas · priorities

Auto-scaling GPU distribution

02 Observability

LLM Observability OTel · Langfuse · CH

Agent Analytics funnels · deflection · cost

Cost Tracking project budgets

03 Guardrails

Guardrails Engine LiteLLM + rules

Sensitive Data PII · PHI · PCI · privacy

Attack Protection injection · jailbreak

04 Evaluation

Real-time Eval Langfuse · LLM judges

Agent Training production cases

Synthetic Datasets domain-specific gen

Evals project-specific

Fine-tuning domain adaptation

05 Documents

Vector Database Qdrant

RAG Pipelines Temporal · client data

Knowledge Maps document graphs

06 Agents

Orchestration pydantic-ai · MCP · A2A

AI-native Chat streaming · threads

Service Agent routing · escalation

Memory session + long-term

Custom Agents per business process

Integration Adapters CRM · ERP · API

Content Digital Twin brand voice & tone

Security

Data protectionAI SafetyAccess controlAudit

Open source Manaraga platform Custom development

Three layers

Every project is assembled from three layers. The bottom layer does not depend on the industry. The middle layer is reused across projects. The top layer is written for the exact business workflow. Security is not a separate layer but a cross-cutting requirement: data masking, attack filtering, decision audit, and access control live inside every component.

Model infrastructure

Hosting, request routing, vector databases. Mature open source already works here; our job is to tune it for enterprise load.

Agent platform

Observability, guardrails, evaluation, orchestration, and memory. This is where most of our own reusable engineering work lives.

Project-specific development

Client document search, domain agents, CRM and ERP connectors, synthetic datasets, fine-tuning. Code written for the business process and handed over to the client.

Modules

Inference and routing

Every project needs several compute profiles at once: classification, generation, vectorization, each with different latency and cost constraints. One model and one shared pool of capacity do not work in an enterprise environment. Tasks compete for resources, and a single provider outage can freeze the whole system — exactly what happened on the transport project before we split workloads across separate instances with automatic fallback.

We split inference into four GPU instance types: reasoning, fast generation, vectorization, and vision. The router distributes requests, flips to a backup model on failure, and enforces quotas and priorities by project.

Model Serving vLLM × 4 instance types Auto-scaling model placement across GPUs Inference Optimization batching · KV-cache · prefix caching LLM Gateway LiteLLM · fallback · business-priority quotas Access control project and role policies

Observability

Latency and error rate do not explain why the agent responded the way it did or how much one outcome actually cost. On the telecom project it was business metrics, not infrastructure metrics, that showed where the agent already beat the operator and where it had to stay out of the loop.

We collect two layers of metrics. The engineering layer traces every call, every tool chain, and token-level cost. The business layer tracks support funnels, automation share, and cost per outcome.

LLM Observability OpenTelemetry · Langfuse · ClickHouse Agent Analytics funnels · deflection rate · cost per outcome Cost Tracking token accounting · project budgets

Guardrails

Prompt injection and data leakage are baseline threats, and standard libraries can catch them. But every industry also has its own prohibitions that no generic library understands. In the investment project, the agent began hinting at answers to qualification tests — something the regulator bans outright.

We place filtering on both sides of every model call: data masking, attack detection, and business-specific rules. In banking we built a multi-layer compliance system with prompt constraints, refusal scenarios, checker loops, and audit logs for every answer.

Input / Output Filtering custom rules on top of LiteLLM Sensitive data PII · PHI · PCI · local privacy law Attack protection prompt injection · jailbreak

In the investment project, compliance became a multi-layer stack: prompt constraints, refusal paths, checker loops, and audit of every answer. Case →

Evaluation

Quality cannot be checked once and forgotten. Models change, data changes, prompts drift, and answers get worse. On the transport project, a binary “confident / not confident” threshold created too many false escalations: the system sent tickets to operators that it was actually capable of answering safely.

We built a three-pass confidence formula with 30+ parameters calibrated on real requests. It decides when the agent can answer and when a human is still required. Alongside it run LLM judges, production benchmark cases, and synthetic datasets so quality regressions are caught before production, not after a complaint.

Real-time Evaluation Langfuse · LLM judges Agent Training production benchmark cases Synthetic Datasets domain-specific generation Regression Testing project-specific eval suites Fine-tuning adaptation to the domain and terminology

In the transport project, the key artifact was a three-pass confidence formula with more than thirty production-calibrated parameters. Case →

Documents

Every company has its own regulations, knowledge base, and normative documentation. Standard RAG can retrieve a vaguely similar paragraph, but enterprise tasks are stricter. On the telecom project, tariff questions required exact numbers from tables while vector search kept returning approximate narrative matches.

We built a dual index: one branch for semantic retrieval and another for exact data such as tariff tables, prices, and technical parameters. Pure vector search tends to lose numbers and tables because they vectorize badly.

Vector Database Qdrant RAG Pipelines Temporal · tuned to client data Knowledge Maps document relationship graphs

In the telecom project, a dual index separated semantic search from exact-table retrieval with prices and parameters. Case →

Agents

An agent in a demo answers questions. An agent in production must remember context across sessions, call tools, follow a scenario, and escalate to a human at the right boundary. In the finance project, the agent ran a strict sales funnel, remembered previous client conversations without mixing products, and stayed inside compliance limits — effectively a finite-state machine with multiple control loops.

We built orchestration, chat, and memory infrastructure so it does not need to be rebuilt on every project. A dedicated component, Content Digital Twin, is responsible for corporate tone: it took more than sixty iterations before the agent sounded like an actual employee rather than a chatbot.

Orchestration pydantic-ai · MCP · A2A AI-native Chat streaming · threads · auth Service Agent request handling · routing · escalation Memory session context + cross-session knowledge Custom Agents per business process Integration Adapters CRM · ERP · ticketing · internal APIs Content Digital Twin brand voice, tone, and terminology

How a project is assembled

Every project pulls its own subset from the map. Observability and evaluation are required everywhere. Guardrails are tuned to the industry: multi-layer compliance in finance, hallucination filters in transport, corporate tone and escalation boundaries in telecom. Document and agent modules are always assembled around the exact process.

Everything is deployed inside the client perimeter. Every component ships as a standard container.

18 services we tuned across projects in four industries

INFERENCE AND ROUTING

vLLM slow thinking

reasoning model

vLLM fast generation

fast generation

vLLM embedding

vectorization

vLLM vision

image and document processing

LiteLLM gateway

single API, fallback, compliance guardrails

PostgreSQL config

settings, virtual keys, access policies

OBSERVABILITY AND EVALUATION

OpenTelemetry Collector telemetry

trace collection and routing

Langfuse web + worker

UI, dashboards, eval flows, datasets

ClickHouse storage

trace and eval result storage

Redis queues

background processing queues

DOCUMENTS AND PIPELINES

Temporal server + web + admin

pipeline orchestration and monitoring

Qdrant vector index

document, contract, and knowledge-base chunks

S3 storage

documents, guidelines, media

AGENTS

Agents Service runtime

agent business logic and session management

PostgreSQL history

dialog history and session state

18 services · deployed inside the client perimeter · every component ships as a standard container

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai