ServeAI is a REST API server that exposes the Axion AI capabilities to other services. Instead of each service importing the AI stack directly, they send HTTP requests to ServeAI which routes prompts through the appropriate provider.
python ServeAI.py main --port 9700
--port option (default 9700).ObjData instance runs the preflight check (database connectivity, config loading).ensure_ollama() runs resource.bin/start_ollama.sh to verify/start the Ollama service and pull the default model (mistral). If Ollama fails, the server continues — cloud MCP providers remain available.display_ollama_models() prints a rich table of installed Ollama models to the console.resource.bin/start_ollama.sh)The startup script handles three tasks:
ollama binary is on PATH.ollama serve in the background and waits up to 30 seconds for it to become ready.$1, defaults to mistral) is not already downloaded, pulls it from the Ollama registry.When a service sends a POST /prompt request:
Client request
|
v
FastAPI (WebServer.app on port 9700)
|
v
ServeAI endpoint handler
|
v
ObjAI(db=0, model="llm:ollama:mistral")
|
+-- model string parsed: factory="llm", provider="ollama", model="mistral"
|
v
ObjAI.prompt(role, prompt, image_base64)
|
+-- factory == "llm" --> llm_factory() --> factory.ai/package.llm/ObjAILlmOllama.py
+-- factory == "mcp" --> mcp_factory() --> factory.ai/package.mcp/ObjAiMcp{Provider}.py
|
v
Provider executes the prompt (Ollama local, OpenAI API, Anthropic API, etc.)
|
v
JSON response returned to caller
Models use a colon-separated string: <factory>:<provider>:<model>
| Identifier | Factory | Provider | Model |
|---|---|---|---|
llm:ollama:mistral |
llm | ollama | mistral |
llm:llava |
llm | llava | (default) |
mcp:openai:gpt-4 |
mcp | openai | gpt-4 |
mcp:anthropic:claude-3-opus-20240229 |
mcp | anthropic | claude-3-opus-20240229 |
mcp:google:gemini-pro |
mcp | gemini-pro | |
mcp:ollama:mistral |
mcp | ollama | mistral |
mcp:mistral:mistral-large-latest |
mcp | mistral | mistral-large-latest |
mcp:huggingface:mistralai/Mistral-7B-Instruct-v0.2 |
mcp | huggingface | mistralai/Mistral-7B-Instruct-v0.2 |
factory.ai/package.llm/ modules.factory.ai/package.mcp/ modules. API keys are read from the config file under sections like [ai_mcp_openai].The embed endpoint combines document indexing with semantic search:
1. ObjAI loads documents from the database using the provided SQL query
2. Each document is processed (cleaned, split into lines, deduplicated)
3. Ollama generates vector embeddings for each document
4. Embeddings are stored in an in-memory ChromaDB collection
5. The search query is embedded and matched against the collection
6. The best matching document is returned
7. If a role is provided, an enriched prompt is sent to the AI model
using the matched document as context
This enables Retrieval-Augmented Generation (RAG) where the AI answer is grounded in actual data from the database.
Returns the service health status with diagnostics. Checks Ollama reachability and database connectivity.
Response:
{
"status": "ok",
"service": "mcp",
"uptime_seconds": 3621.5,
"ollama": true,
"database": true
}
status — "ok" when both Ollama and database are healthy, "degraded" when either is down.uptime_seconds — Seconds since the server process started.ollama — true if the Ollama service responds at http://localhost:11434/.database — true if a SELECT 1 query succeeds.Checks whether the Ollama service is running on a GPU.
Response:
{"gpu_available": true}
Lists available AI providers by scanning the factory directories. Responses are cached for 30 seconds (CACHE_TTL).
Response:
{
"mcp_providers": ["anthropic", "google", "huggingface", "mistral", "ollama", "openai"],
"llm_types": ["llava", "ollama"]
}
Lists locally installed Ollama models. Runs ollama list and returns the parsed output as JSON. Successful responses are cached for 30 seconds (CACHE_TTL); error responses are never cached.
Response:
{
"models": [
{"name": "mistral:latest", "id": "2ae6f6dd7a3d", "size": "4.1 GB"},
{"name": "llama2:latest", "id": "78e26419b446", "size": "3.8 GB"},
{"name": "codellama:13b", "id": "9f438cb9cd58", "size": "7.4 GB"}
]
}
Errors:
500 — Ollama is not running or ollama list failed.504 — ollama list timed out (10s limit).Sends a prompt to any AI model and returns the response.
Request body:
{
"model": "llm:ollama:mistral",
"role": "You are a SQL developer.",
"prompt": "Write a query to find duplicate emails.",
"image_base64": ""
}
model (string, optional): Model identifier. Defaults to "llm:ollama:mistral".role (string, optional): System role / context for the model.prompt (string, required): The user prompt text.image_base64 (string, optional): Base64-encoded image data or a file path for vision models.Response:
{
"response": "SELECT email, COUNT(*) FROM users GROUP BY email HAVING COUNT(*) > 1;",
"model": "llm:ollama:mistral"
}
Errors:
400 — Missing or empty prompt field, or invalid JSON.429 — Too many concurrent requests (all PROMPT_MAX_CONCURRENT slots in use).500 — Model not found or provider failure.504 — Request timed out (default 120s). Configurable via PROMPT_TIMEOUT.Indexes documents from a SQL query, computes embeddings, and performs a semantic search. Optionally sends the matched context to an AI model for a RAG response.
Request body:
{
"sql": "SELECT description FROM def_product LIMIT 100",
"query": "2 seater couch",
"embedding_model": "mxbai-embed-large",
"model": "llm:ollama:mistral",
"role": "You are a product search assistant."
}
sql (string, required): SQL query to load documents for indexing.query (string, required): Search query to match against indexed documents.embedding_model (string, optional): Ollama embedding model. Defaults to "mxbai-embed-large".model (string, optional): AI model for the follow-up prompt. Defaults to "llm:ollama:mistral".role (string, optional): If provided alongside a match, sends an enriched prompt using the matched document as context.Response:
{
"match": "Comfortable 2-seat fabric sofa in charcoal grey",
"response": "Based on the catalogue, the closest match is a comfortable 2-seat fabric sofa in charcoal grey.",
"model": "llm:ollama:mistral"
}
Errors:
400 — Missing sql or query, or invalid JSON.429 — Too many concurrent requests (all PROMPT_MAX_CONCURRENT slots in use).500 — Database, embedding, or model failure.504 — Request timed out (default 120s). Configurable via PROMPT_TIMEOUT.Streams a prompt response token-by-token via Server-Sent Events (SSE). Only supports local Ollama models (llm:ollama:*). The endpoint calls ollama.chat(stream=True) directly, bypassing the ObjAI layer.
Request body:
{
"model": "llm:ollama:mistral",
"role": "You are a helpful assistant.",
"prompt": "Explain REST APIs."
}
model (string, optional): Must be llm:ollama:<model_name>. Defaults to "llm:ollama:mistral".role (string, optional): System role / context.prompt (string, required): The user prompt text.Response: SSE stream (text/event-stream). Each token is sent as a separate event:
data: {"token": "REST"}
data: {"token": " APIs"}
data: {"token": " are"}
...
data: {"done": true}
On error during generation:
data: {"error": "connection refused"}
Errors:
400 — Non-Ollama model (e.g. mcp:openai:gpt-4), missing prompt, or invalid JSON.401 — Invalid or missing API token.429 — Too many concurrent requests.curl example:
curl -N -X POST http://localhost:9700/prompt/stream \
-H "Content-Type: application/json" \
-d '{"model": "llm:ollama:mistral", "prompt": "What is Python?"}'
Default port is 9700. Override via CLI:
python ServeAI.py main --port 9800
Or via the PORT environment variable.
All endpoints except /health require a valid API token when authentication is enabled. Set the shared key in config.yaml:
base:
ai_mcp_server:
api_key: "your-shared-secret-key"
When api_key is empty (default), authentication is disabled and all requests pass.
The token can be provided via any of these headers (checked in order):
| Header | Example |
|---|---|
token |
token: your-shared-secret-key |
api-key |
api-key: your-shared-secret-key |
Authorization |
Authorization: Bearer your-shared-secret-key |
Errors:
401 — Invalid or missing API token.Every request is logged to the track_mcp database table. The table is auto-created at startup via create_tables_from_yaml("ServeAI") using the schema in factory.core/ServeAI.yaml.
| Column | Type | Description |
|---|---|---|
Guid |
char(50) PK | Unique GUID generated via get_uuid("ai") |
Endpoint |
varchar(100) | Request path (e.g. /prompt) |
Model |
varchar(255) | AI model identifier (empty for non-AI endpoints) |
StatusCode |
int | HTTP response status |
Latency |
decimal(10,3) | Request duration in seconds |
Token |
varchar(50) | First 8 characters of the auth token (masked) |
Outcome |
text | JSON response body |
CreatedDate |
datetime | Timestamp (auto-set) |
Tracking failures are silently ignored and never affect the API response.
The /prompt, /embed, and /prompt/stream endpoints are protected by a shared concurrency semaphore (PROMPT_MAX_CONCURRENT, default 5). When all slots are in use, new requests receive an immediate 429 Too Many Requests response instead of queueing. The semaphore is released after each request completes (or after streaming finishes).
/providers and /models responses are cached in memory with a TTL of 30 seconds (CACHE_TTL). This avoids repeated filesystem scans and ollama list subprocess calls. Error responses from /models are never cached. The cache is cleared automatically on server restart.
AI prompt requests (/prompt and /embed) are subject to a PROMPT_TIMEOUT of 120 seconds. If the AI model does not respond within this window the server returns a 504 error instead of blocking the worker indefinitely. Adjust the PROMPT_TIMEOUT constant in ServeAI.py to change the limit.
SSL certificates are loaded automatically from the package configuration. When available, the server starts with HTTPS. Set DO_DEBUG = True in the source to force HTTP during development.
Cloud providers require API keys configured in the INI file:
[ai_mcp_openai]
api_key = sk-...
[ai_mcp_anthropic]
api_key = sk-ant-...
[ai_mcp_google]
api_key = AIza...
[ai_mcp_mistral]
api_key = ...
[ai_mcp_huggingface]
api_key = hf_...
Local Ollama models require no API key — only a running Ollama service.
# Health check
curl http://localhost:9700/health
# List providers
curl http://localhost:9700/providers
# List installed Ollama models
curl http://localhost:9700/models
# Prompt a local Ollama model
curl -X POST http://localhost:9700/prompt \
-H "Content-Type: application/json" \
-d '{"model": "llm:ollama:mistral", "role": "You are a helpful assistant.", "prompt": "What is Python?"}'
# Prompt OpenAI via MCP
curl -X POST http://localhost:9700/prompt \
-H "Content-Type: application/json" \
-d '{"model": "mcp:openai:gpt-4", "role": "You are a helpful assistant.", "prompt": "Explain REST APIs."}'
# RAG query
curl -X POST http://localhost:9700/embed \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT description FROM def_product LIMIT 50", "query": "red running shoes", "role": "You are a product search assistant."}'
import requests
import json
import sseclient # pip install sseclient-py
BASE_URL = "http://localhost:9700"
# ── 1. Health check ──────────────────────────────────────────────────
resp = requests.get(f"{BASE_URL}/health")
print(resp.json())
# Expected:
# {
# "status": "ok",
# "service": "mcp",
# "uptime_seconds": 42.3,
# "ollama": true,
# "database": true
# }
# ── 2. List installed models ─────────────────────────────────────────
resp = requests.get(f"{BASE_URL}/models")
print(resp.json())
# Expected:
# {
# "models": [
# {"name": "mistral:latest", "id": "6577803aa9a0", "size": "4.4 GB"},
# {"name": "llama2:latest", "id": "78e26419b446", "size": "3.8 GB"}
# ]
# }
# ── 3. Send a prompt ─────────────────────────────────────────────────
resp = requests.post(f"{BASE_URL}/prompt", json={
"model": "llm:ollama:mistral",
"role": "You are a concise assistant. Reply in one sentence.",
"prompt": "What is the capital of Japan?",
})
data = resp.json()
print(f"Status: {resp.status_code}")
print(f"Model: {data['model']}")
print(f"Answer: {data['response']}")
# Expected:
# Status: 200
# Model: llm:ollama:mistral
# Answer: The capital of Japan is Tokyo.
#
# Full response JSON:
# {
# "response": " The capital of Japan is Tokyo.",
# "model": "llm:ollama:mistral"
# }
# ── 4. Stream a prompt (Server-Sent Events) ──────────────────────────
resp = requests.post(
f"{BASE_URL}/prompt/stream",
json={
"model": "llm:ollama:mistral",
"role": "You are a concise assistant. Reply in one sentence.",
"prompt": "What is the largest ocean?",
},
stream=True,
headers={"Accept": "text/event-stream"},
)
full_response = ""
for line in resp.iter_lines(decode_unicode=True):
if line.startswith("data: "):
event = json.loads(line[6:])
if "token" in event:
full_response += event["token"]
print(event["token"], end="", flush=True)
elif "done" in event:
print() # newline after final token
print(f"\nFull: {full_response}")
# Expected output (token by token):
# The Pacific Ocean is the largest ocean on Earth.
#
# Each SSE event:
# data: {"token": " The"}
# data: {"token": " Pacific"}
# data: {"token": " Ocean"}
# data: {"token": " is"}
# data: {"token": " the"}
# data: {"token": " largest"}
# data: {"token": " ocean"}
# data: {"token": " on"}
# data: {"token": " Earth"}
# data: {"token": "."}
# data: {"done": true}
# ── 5. RAG query ─────────────────────────────────────────────────────
resp = requests.post(f"{BASE_URL}/embed", json={
"sql": "SELECT description FROM def_product LIMIT 100",
"query": "leather wallet",
"embedding_model": "mxbai-embed-large",
"model": "llm:ollama:mistral",
"role": "You are a product catalogue assistant.",
})
data = resp.json()
print(f"Match: {data['match']}")
print(f"Response: {data['response']}")
# Expected:
# {
# "match": "Premium Italian leather bifold wallet in brown with RFID blocking",
# "response": "Based on our catalogue, the closest match is a premium Italian leather bifold wallet...",
# "model": "llm:ollama:mistral"
# }
# ── 6. Error cases ───────────────────────────────────────────────────
# Missing prompt → 400
resp = requests.post(f"{BASE_URL}/prompt", json={"model": "llm:ollama:mistral"})
print(resp.status_code, resp.json())
# Expected: 400 {"error": "prompt is required"}
# Non-Ollama streaming → 400
resp = requests.post(f"{BASE_URL}/prompt/stream", json={
"model": "mcp:openai:gpt-4",
"prompt": "hello",
})
print(resp.status_code, resp.json())
# Expected: 400 {"error": "Streaming only supports llm:ollama:* models"}
# Rate limited → 429 (when all 5 concurrent slots are occupied)
# Expected: 429 {"error": "Too many concurrent requests", "model": "llm:ollama:mistral"}
Large language models are trained on general knowledge up to a cutoff date. They do not know about your private data — your product catalogue, your customer records, your internal documentation. If you ask a model "What leather wallets do we sell?", it will either hallucinate an answer or admit it does not know.
RAG solves this by fetching relevant data from your own sources and injecting it into the prompt before the model generates a response. The model still does the reasoning, but now it has your actual data to reason about.
Your Database Embedding Model Vector Store
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Products │──── SQL query ──>│ mxbai-embed │── vectors ─>│ ChromaDB │
│ Documents │ │ -large │ │ (memory) │
│ Articles │ └──────────────┘ └────┬─────┘
└──────────┘ │
│
User Query Embedding Model │
┌──────────┐ ┌──────────────┐ │
│ "leather │── embed query ──>│ mxbai-embed │── vector ──> similarity
│ wallet" │ │ -large │ search
└──────────┘ └──────────────┘ │
│
best match
│
v
LLM (Mistral, GPT-4, etc.) ┌──────────────┐
┌──────────────────────┐ │ "Premium │
│ Role: You are a │<── enriched prompt ──────────────│ Italian │
│ product assistant. │ "Using this context: │ leather │
│ │ Premium Italian leather │ bifold │
│ Response: Based on │ bifold wallet in brown │ wallet in │
│ our catalogue, the │ with RFID blocking... │ brown..." │
│ best match is the │ └──────────────┘
│ Premium Italian... │ Query: "leather wallet"
└──────────────────────┘
Before you can search, you need to build a knowledge base.
Load documents — A SQL query pulls text from your database. This could be product descriptions, help articles, policy documents, or any text you want the AI to reference.
Clean and split — Each result row is processed: newlines, underscores, and formatting artifacts are stripped. Long text is split into individual lines. Duplicates are removed. This produces a clean list of text chunks.
Generate embeddings — Each text chunk is passed through an embedding model (default: mxbai-embed-large running on Ollama). The model converts the text into a high-dimensional numeric vector (a list of floats) that captures the semantic meaning of the text. Words like "sofa" and "couch" produce vectors that are close together, even though the strings are different.
Store in vector database — Each embedding is stored in a ChromaDB collection alongside the original text. ChromaDB is an in-memory vector database optimised for similarity search.
When a user query arrives:
Embed the query — The same embedding model converts the search query into a vector using the same dimensional space as the stored documents.
Similarity search — ChromaDB compares the query vector against all stored vectors using cosine similarity. The document whose vector is closest to the query vector is the best semantic match. This works on meaning, not keywords — "2 seater couch" will match "comfortable two-seat sofa" even though they share no words.
Return the match — The original text of the closest document is returned. This is the "retrieved" part of RAG.
If the caller provides a role in the request:
Build enriched prompt — The matched document is prepended to the query as context: "Using this context: {matched_document}\n\n{original_query}".
Send to LLM — The enriched prompt is sent to the configured AI model (local Ollama, OpenAI, Anthropic, etc.) with the specified system role.
Return grounded response — The model generates a response that is grounded in your actual data rather than its general training knowledge. This dramatically reduces hallucination because the answer is based on a real document from your database.
| Component | Role | What happens without it |
|---|---|---|
| SQL source | Provides the raw knowledge base | The AI has no private data to reference |
| Embedding model | Converts text to semantic vectors | Search would be keyword-only, missing synonyms and meaning |
| ChromaDB | Fast vector similarity search | No way to find the most relevant document from thousands |
| Similarity search | Finds the best match by meaning | User would need to know exact keywords in the database |
| LLM | Generates a natural language answer | User gets a raw database row instead of a coherent response |
| Context injection | Grounds the LLM in real data | The LLM hallucinates or gives generic answers |
A customer service bot needs to answer product questions:
Database contains 500 product descriptions.
User asks: "Do you have anything for running in the rain?"
Without RAG, the AI guesses. With RAG:
The answer is accurate because it comes from your actual product data, not the model's imagination.
In this codebase, the RAG workflow lives in ObjAI (factory.core/ObjAI.py):
| Method | Phase | What it does |
|---|---|---|
get_documents_query(sql) |
Indexing | Loads and cleans documents from database |
compute_embeddings(model) |
Indexing | Generates vectors via Ollama and stores in ChromaDB |
find_prompt_match(query) |
Retrieval | Embeds the query and runs similarity search |
prompt(role, text) |
Generation | Sends the enriched prompt to any AI model |
The /embed endpoint in ServeAI wraps all four steps into a single HTTP call, making RAG available to any service in the ecosystem without importing the AI stack.
ServeAI connects to whichever Ollama instance is configured for the deployment host. The TechnoCore LAN has multiple Ollama nodes with different GPU backends and performance profiles.
| IP | Host | GPU | Backend | Models | tok/s (8B) | EC2 equivalent | Cost/mo (ZAR) |
|---|---|---|---|---|---|---|---|
| 10.0.10.69 | AI Server (llama.axion VM on Beast) | NVIDIA RTX 4070 SUPER 12GB | CUDA | qwen3:latest, qwen3:14b, mxbai-embed-large, llava, vicuna | ~80–100 | g5.xlarge (A10G 24GB) |
R0 (≈R450 elec.) |
| 10.0.10.52 | Silenus (workstation) | NVIDIA RTX 2060 SUPER 8GB | CUDA | qwen3:8b | ~40–60 | g4dn.xlarge (T4 16GB) |
R0 |
| 10.0.10.48 | ollama LXC (CT135 on Beast) | AMD Raphael iGPU — Vulkan active | CPU-only (active); Vulkan via OLLAMA_VULKAN=1 |
qwen3:8b | ~4–8 | m7i.xlarge (CPU) |
R0 |
| 10.0.10.56 | Minisforum | AMD Radeon 680M iGPU (RDNA 2) | ROCm (HSA_OVERRIDE_GFX_VERSION=10.3.1) |
(to be added) | ~8–12 | m7i.xlarge (CPU) |
R0 |
| 10.0.10.51 | Lasercut | Intel UHD 600 (Gemini Lake) | CPU-only | mistral | ~3–5 | t3.medium (CPU) |
R0 |
EC2 on-demand equivalent if running 24/7:
g5.xlarge= R13,450/mo ·g4dn.xlarge= R7,030/mo ·m7i.xlarge= R2,660/mo (R18.50/USD, April 2026).
Costs in ZAR at ~R18.50/USD, April 2026. On-prem nodes at R0 marginal cost (hardware already owned and running). EC2 priced on-demand 24/7. Gemini API is pay-per-token — priced per 1M output tokens.
| Option | GPU | tok/s (8B) | EC2 equivalent | Monthly (ZAR) |
|---|---|---|---|---|
| 10.0.10.69 on-prem | RTX 4070 SUPER 12GB | ~80–100 | g5.xlarge |
R0 (≈R450 elec.) |
| 10.0.10.48 CT135 on-prem | AMD Raphael Vulkan/CPU | ~4–8 | m7i.xlarge |
R0 |
m7i.xlarge EC2 (CPU) |
none | ~5–8 | — | R2,660/mo |
g4dn.xlarge EC2 |
NVIDIA T4 16GB | ~25–40 | — | R7,030/mo |
g5.xlarge EC2 |
NVIDIA A10G 24GB | ~50–70 | — | R13,450/mo |
| Model | Context | Cost / 1M input | Cost / 1M output | Quality tier |
|---|---|---|---|---|
| Gemini 1.5 Flash | 1M tokens | R1.39 ($0.075) | R5.55 ($0.30) | Fast, cost-optimised |
| Gemini 2.0 Flash | 1M tokens | R1.85 ($0.10) | R7.40 ($0.40) | Fast, general purpose |
| Gemini 1.5 Pro | 128K tokens | R23.13 ($1.25) | R92.50 ($5.00) | Highest quality |
| DeepSeek Chat (V3) | 64K tokens | R0.28 ($0.015) | R4.44 ($0.24) | Strong reasoning, very cheap |
| DeepSeek Reasoner (R1) | 64K tokens | R2.78 ($0.15) | R13.88 ($0.75) | Chain-of-thought, o1-class |
Pricing sourced April 2026 — verify before use as API rates change frequently.
Gemini: ai.google.dev/pricing · DeepSeek: platform.deepseek.com/pricing
At continuous generation (CT135, ~7.7 tok/s ≈ 20M output tokens/day):
| Option | Daily output cost |
|---|---|
| On-prem CT135 | R0 |
| DeepSeek Chat (V3) | R89/day → R2,670/mo |
| Gemini 1.5 Flash | R111/day → R3,330/mo |
| Gemini 2.0 Flash | R148/day → R4,440/mo |
| DeepSeek Reasoner (R1) | R278/day → R8,340/mo |
| Gemini 1.5 Pro | R1,850/day → R55,500/mo |
Rule of thumb: API wins for sporadic/low-volume use (< a few million tokens/month). On-prem wins at sustained load. DeepSeek V3 is the cheapest capable API option — strong reasoning at ~R4.44/1M output tokens. Gemini 1.5 Pro and DeepSeek R1 are only cost-effective where chain-of-thought quality justifies the premium.
Exchange rate: R18.50/USD. EC2 prices us-east-1 on-demand. Electricity ~R2.50/kWh.
10.0.10.69 runs regardless for other workloads — marginal AI cost is effectively R0.
| Use Case | Node |
|---|---|
| Production / latency-sensitive | 10.0.10.69 (RTX 4070 SUPER) |
| Workstation / interactive dev | 10.0.10.52 (Silenus, RTX 2060 SUPER) |
| Background jobs / hypervisor-safe | 10.0.10.48 (Beast CT135, iGPU Vulkan) |
| Vector DB + inference co-located | 10.0.10.51 (Lasercut, ChromaDB :8000 + Ollama :11434) |
Default is http://localhost:11434. To target a different node:
# config.yaml
base:
ai:
ollama_url: "http://10.0.10.69:11434"
Or at runtime:
OLLAMA_HOST=http://10.0.10.69:11434 python ServeAI.py main
ServeAI fits into the Axion service ecosystem alongside other servers:
| Server | Port | Purpose |
|---|---|---|
| ServeWebsite | 9000 | Main web application |
| ServeReport | 9400 | Report rendering |
| ServeWebHook | 9500 | Webhook processing |
| ServeConversation | 9550 | Conversation engine |
| ServeWorkflow | 9650 | Workflow execution |
| ServeAI | 9700 | AI / Multi-Cloud Provider API |