
NOTICE: All information contained herein is, and remains
the property of TechnoCore.
The intellectual and technical concepts contained
herein are proprietary to TechnoCore and dissemination of this information or reproduction of this material
is strictly forbidden unless prior written permission is obtained
from TechnoCore.
The ObjAI class provides a high-level interface for interacting with various AI models and services. It handles model management, document embeddings, and prompting.
The SQL queries used by the ObjAI class are stored in factory.core/ObjAI.yaml. This allows for easy modification of queries without changing the Python code.
__init__(self, db=0, model: str = "llm:mistral", collectdb: str = "")Initializes the ObjAI instance.
db: The database instance to use.model: The AI model to use.collectdb: The ChromaDB collection to use.set_model(self, model: str = "")Sets the AI model to use for prompting.
check_service_on_gpu(self, service_name: str = "ollama")Checks if a given service is running on the GPU.
local_process_text(self, text_to_process: str)Processes a string by removing newlines, underscores, and asterisks.
get_documents_query(self, sql="")Executes a SQL query and processes the results to populate the documents list.
compute_embeddings(self, embedding_model=LLM_DEFAULT_EMBEDDING)Computes embeddings for the documents in the documents list and adds them to the collection.
find_prompt_match(self, prompt: str = "")Finds the best matching document in the collection for a given prompt.
llm_factory(self, model_type: str = "", model_set: str = "")A factory for creating LLM objects.
prompt(self, role: str = "", prompt: str = "", image_base64: str = "")Sends a prompt to the AI model and returns the response.
list_providers() -> list[str] (static method)Scan factory.ai/package.mcp/ and return a sorted list of available
MCP provider names (e.g. ["anthropic", "ollama", "openai"]).
ObjAiMcp prefix and .py suffix from file names.ObjAiMcpBase.py (the abstract base class)..so files and non-.py files.[] when the folder does not exist.has_gpu() -> bool (static method, cached)Pure hardware detection of an AI accelerator. Cached for the process
lifetime via functools.lru_cache — call has_gpu.cache_clear() in
tests that mock the underlying detection.
nvidia-smi --query-gpu=name --format=csv,noheader and returnsTrue if the output contains a non-empty GPU name.torch.cuda.is_available() (covers both CUDA andtorch.version.hip for explicit ROCm.torch.backends.mps.is_available() for Apple Silicon.False when no accelerator is detected.No env or config gates — use has_ai() for the "can this machine run
AI?" check.
has_ai() -> bool (static method, cached)Gate for AI workloads. Returns True when this machine may run AI,
whether via accelerator or explicit CPU-AI opt-in. Cached via
functools.lru_cache(maxsize=1) — tests that vary env or config
gates must call has_ai.cache_clear() between assertions.
AXION_AI_ENABLED=0 → False (global kill switch).AXION_CPU_AI=1 or system.cpuai truthy in config.yaml → True.has_gpu().Call-sites that formerly used has_gpu() for capability routing
(ObjDocument.maybe_enhance, ObjDocumentSet.process_enhancement_queue)
now use has_ai() so CPU-only machines with system.cpuai: true can
participate.
vram_gb() -> int (static method, cached)Total VRAM across detected GPUs, in GB. Returns 0 on CPU-only
systems (Apple MPS reports 0 even though has_gpu() returns True).
nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounitstorch.cuda.get_device_properties().total_memory.cpu_cores() -> int (static method, cached)Logical CPU core count via os.cpu_count(), with a floor of 1.
_active_tier() -> str (static method)Resolves one of four hardware tier names, in order:
| Tier | Rule |
|---|---|
gpu_high |
vram_gb >= TIER_GPU_HIGH_MIN_VRAM_GB (default 16) |
gpu_low |
Any GPU present (VRAM ≥ 4 GB or has_gpu()) |
cpu_high |
No GPU, cpu_cores >= TIER_CPU_HIGH_MIN_CORES (default 8) |
cpu_low |
No GPU, fewer cores |
Thresholds are defined in AiConstants (ObjConstants.py) and can
be overridden at the constant level.
get_model() -> str, get_model_vision() -> str, get_model_embedding() -> str (static methods)Return the configured model name for the current machine. Each
delegates to _pick_model_tier(category) which walks:
<package>.ai.model.<category>.<tier> in config.yaml for thebase.ai.model.<category>.<tier> with the same walk.AiConstants.<CATEGORY>_<TIER> fallback constants.An explicit empty string in config for the active tier is
honoured as a "skip" signal — get_model_vision() returns "" when
the vision model is intentionally disabled on this tier (e.g. CPU
hosts that shouldn't attempt multimodal inference). Downstream
callers must check for "" before using the value.
ai.model sectionAdd under base: in config.yaml. Per-package overrides live under
<package>.ai.model using the same structure and win over base.
base:
ai:
# Legacy flat keys (still honoured by some call-sites).
default: "mcp:ollama:mistral"
summary: "mcp:ollama:qwen3:8b"
# Per-category / per-tier model defaults. ObjAI picks the tier
# from available hardware (VRAM + CPU cores) and reads the
# matching model name. An empty string means "skip on this tier".
# Fallback walks gpu_high → gpu_low → cpu_high → cpu_low.
model:
generation:
cpu_low: qwen3:1.7b
cpu_high: qwen2.5:3b
gpu_low: gemma3
gpu_high: qwen3:14b
vision:
cpu_low: ""
cpu_high: qwen2.5vl:3b
gpu_low: qwen2.5vl:7b
gpu_high: llama3.2-vision:11b
embedding:
cpu_low: nomic-embed-text
cpu_high: nomic-embed-text
gpu_low: nomic-embed-text
gpu_high: mxbai-embed-large
system:
# Opt-in for running AI on a CPU-only machine. When true,
# has_ai() returns True regardless of has_gpu(). AXION_CPU_AI=1
# in the environment has the same effect per-process.
cpuai: false
| Question | Method | Honours env/config? |
|---|---|---|
| Is there a GPU attached? | has_gpu() |
No — pure hardware probe. |
| Should this process run AI? | has_ai() |
Yes — AXION_AI_ENABLED, AXION_CPU_AI, system.cpuai. |
| How much VRAM? | vram_gb() |
No. |
| How many CPU cores? | cpu_cores() |
No. |
| Which tier is this machine? | _active_tier() |
No — purely hardware-derived. |
| Which model should I use? | get_model{,_vision,_embedding}() |
Yes (via config). |
The ObjAI class is designed to be used with typer for creating command-line applications.
import typer
from factory.core import ObjAI
app = typer.Typer()
@app.command()
def test():
"""Runs a test of the AI model with a sample prompt."""
ai_obj = ObjAI(db=0, model="llm:ollama:mistral")
ai_obj.prompt(LLM_SAMPLE_ROLE, ai_obj.queries['LLM_SAMPLE_PROMPT'])
if __name__ == "__main__":
app()
Ollama can be GPU-accelerated in several ways depending on the host hardware. Each mode requires different setup and has different performance characteristics.
Standard CUDA path. Requires NVIDIA driver + CUDA toolkit installed on the host.
# Install NVIDIA driver (Ubuntu/Zorin)
sudo ubuntu-drivers install
# or:
sudo apt install -y nvidia-driver-580
# Install Ollama (auto-detects CUDA)
curl -fsSL https://ollama.com/install.sh | sh
No special env vars needed — Ollama detects CUDA automatically via nvidia-smi.
TechnoCore nodes: 10.0.10.69 (RTX 4070 SUPER, ~80–100 tok/s, EC2 equiv: g5.xlarge = R13,450/mo), 10.0.10.52 (RTX 2060 SUPER, ~40–60 tok/s, EC2 equiv: g4dn.xlarge = R7,030/mo)
For AMD integrated GPUs that ROCm doesn't officially list. Requires overriding the GFX version to the closest supported RDNA 2 target.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Add to /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.1"
Or in ~/.zshrc for interactive use:
export HSA_OVERRIDE_GFX_VERSION=10.3.1
Applies to: AMD Radeon 680M (Ryzen 6800U/7000 series), AMD Raphael graphics — any RDNA 2 iGPU not natively in ROCm's support list.
TechnoCore nodes: 10.0.10.56 (Minisforum, Radeon 680M, ~8–12 tok/s, EC2 equiv: m7i.xlarge = R2,660/mo)
For cases where ROCm is unavailable or insufficient — uses the Vulkan compute backend instead. Works inside LXC containers with /dev/kfd + /dev/dri/renderD128 passed through.
LXC config (/etc/pve/lxc/<CTID>.conf on Proxmox):
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 234:0 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file
Group alignment — the host render group GID must match inside the container:
# Check host GID
getent group render # e.g. render:x:993:
# Inside container, align to match
groupmod -g 993 render
Systemd override (/etc/systemd/system/ollama.service.d/override.conf):
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Vulkan ICD for AMD:
apt install -y mesa-vulkan-drivers
Performance: ~3–4 tok/s for 8B models on a minimal iGPU (2 CUs). Slower than CPU-only on the same host, but uses 0% CPU — valuable on a hypervisor where CPU headroom matters more than inference speed.
Available memory: The Vulkan backend on AMD APUs can see all system RAM as unified memory (up to 63 GB on a 128 GB system), so even large models load without swapping.
TechnoCore nodes: 10.0.10.48 (Beast CT135) — Vulkan active (~3.7 tok/s) or CPU-only (~7.7 tok/s). EC2 equiv: m7i.xlarge = R2,660/mo.
No GPU required. Enable explicitly for machines with no GPU or where GPU acceleration isn't worth the setup cost.
# Install Ollama normally
curl -fsSL https://ollama.com/install.sh | sh
# No GPU env vars needed; Ollama falls back to CPU automatically.
# To force CPU even when a GPU is detected:
# CUDA_VISIBLE_DEVICES="" OLLAMA_VULKAN=0 ollama serve
In Axion, enable CPU inference in config.yaml:
base:
system:
cpuai: true
Or per-process: AXION_CPU_AI=1 python ServeAI.py main
TechnoCore nodes: 10.0.10.51 (Lasercut, Celeron N4020, mistral, ~3–5 tok/s, EC2 equiv: t3.medium = R690/mo)
| Mode | Backend | Example Hardware | tok/s (8B) | CPU Usage |
|---|---|---|---|---|
| CUDA | libggml-cuda.so |
RTX 4070 SUPER | 80–100 | ~0% |
| CUDA | libggml-cuda.so |
RTX 2060 SUPER | 40–60 | ~0% |
| ROCm iGPU | libggml-hip.so |
Radeon 680M | 8–12 | ~0% |
| Vulkan iGPU | libggml-vulkan.so |
AMD Raphael (2 CU) | 3–4 | ~0% |
| CPU | libggml-cpu-*.so |
Ryzen 9 7950X3D | 7–10 | 100% of allocated cores |
| CPU | libggml-cpu-*.so |
Celeron N4020 | 3–5 | 100% |
Inference speed is memory-bandwidth bound, not compute bound. A CPU with 90 GB/s DDR5 and an 8 B model (5 GB) hits ~18 tok/s theoretical max. Faster GPUs have dedicated GDDR bandwidth (400+ GB/s) which is why they outperform CPUs.
On-demand pricing, 24/7 runtime, ZAR at ~R18.50/USD (April 2026). On-prem nodes have R0 marginal cost — hardware already owned and running.
| Option | GPU | tok/s (8B) | Monthly (USD) | Monthly (ZAR) |
|---|---|---|---|---|
| 10.0.10.69 on-prem | RTX 4070 SUPER 12GB | ~80–100 | — | R0 (≈R450 elec.) |
| 10.0.10.48 CT135 on-prem | AMD Raphael Vulkan / CPU | ~4–8 | — | R0 |
m7i.xlarge EC2 (CPU only) |
none | ~5–8 | $144 | R2,660 |
g4dn.xlarge EC2 |
NVIDIA T4 16GB | ~25–40 | $380 | R7,030 |
g5.xlarge EC2 |
NVIDIA A10G 24GB | ~50–70 | $727 | R13,450 |
| Model | Cost / 1M input | Cost / 1M output (ZAR) |
|---|---|---|
| Gemini 1.5 Flash | R1.39 | R5.55 |
| Gemini 2.0 Flash | R1.85 | R7.40 |
| Gemini 1.5 Pro | R23.13 | R92.50 |
| DeepSeek Chat (V3) | R0.28 | R4.44 |
| DeepSeek Reasoner (R1) | R2.78 | R13.88 |
R18.50/USD, April 2026. Gemini: ai.google.dev/pricing. DeepSeek: platform.deepseek.com/pricing. Verify before use — API pricing changes frequently.
At continuous generation on CT135 (~7.7 tok/s ≈ 20M output tokens/day):
| Option | Daily output cost | Monthly (ZAR) |
|---|---|---|
| On-prem CT135 | R0 | R0 |
| DeepSeek Chat (V3) | R89/day | R2,670 |
| Gemini 1.5 Flash | R111/day | R3,330 |
| Gemini 2.0 Flash | R148/day | R4,440 |
| DeepSeek Reasoner (R1) | R278/day | R8,340 |
| Gemini 1.5 Pro | R1,850/day | R55,500 |
Rule of thumb: API wins for sporadic/low-volume use (< a few million tokens/month). On-prem wins at sustained load. DeepSeek V3 is the cheapest capable API option. Gemini 1.5 Pro and DeepSeek R1 are only cost-effective where chain-of-thought quality justifies the premium over local models.
g5.xlarge is the closest EC2 equivalent to 10.0.10.69 — running it 24/7 costs R13,450/mo vs effectively R0 on-prem.
Install CUDA drivers on a platform with CUDA PCI hardware.
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
sudo apt-get install -y cuda-drivers
The mxbai-embed-large model is known for its ability to generalize well
across several domains, tasks, and text lengths. It has been trained
without overlapping the MTEB data, indicating robust performance in
diverse scenarios. This model is suitable for applications that require
understanding and processing large and varied datasets.
ollama pull mxbai-embed-large
When writing prompts for ObjAI.prompt(), follow these best practices
from the Anthropic prompt engineering docs.
Wrap distinct sections of a prompt in XML tags so the model can parse
them unambiguously:
<task>Interpret these simulation results.</task>
<baselines>
scored: 53%, no-score: 47%
D001 decline reason: 37%
</baselines>
<output_format>
Use bullet points. No more than 20 lines.
</output_format>
Recommended tags: <task>, <context>, <baselines>, <output_format>,
<example>, <documents>.
Set the role parameter to a specific persona. One sentence is enough:
ai.prompt(
role="You are a senior credit risk analyst.",
prompt=full_prompt,
)
When asking the model to review data, supply expected values so it can
flag anomalies:
<baselines>
scored: 53%, no-score: 47%
PayingSeg: 53%, NonPayingSeg: 35%, ThinFile: 11%
</baselines>
Flag any metric deviating more than 10% from these baselines.
Ask the model to cite specific numbers from the input before drawing
conclusions. Phrases like "state specific counts and percentages" and
"ground every claim in the data" reduce hallucination.
Wrap examples in <example> tags. 3–5 diverse examples improve
accuracy and consistency:
<examples>
<example>
Input: score=524, strikethrough=1
Output: Declined — score below cutoff, strikethrough=Decline
</example>
</examples>
For analytical prompts, use an explicit chain for each finding:
observation → business impact → recommended action
This ensures the model produces actionable output, not just
descriptions.
Add explicit instructions to prevent the model from describing column
names instead of analysing values:
You analyse numbers, distributions, and business patterns.
You DO NOT describe data structure or column names.
| Test Suite | Tests | Status | Purpose |
|---|---|---|---|
test_ObjAI.py |
8 | ✅ All passed | list_providers() static method |
dev-env/bin/pytest resource.test/pytests/factory.ai/test_ObjAI.py -v
TestListProviders — ObjAI.list_providers() scans factory.ai/package.mcp/
and returns the sorted list of MCP provider names available to ObjAI.
| Test | What it checks |
|---|---|
test_returns_sorted_list |
Result is alphabetically sorted |
test_excludes_base_module |
"base" (from ObjAiMcpBase.py) is not in the list |
test_includes_known_providers |
ollama, anthropic, playwright are present |
test_excludes_non_py_files |
.so compiled files and .md docs are excluded |
test_returns_empty_list_when_folder_missing |
Returns [] when package.mcp/ does not exist |
test_returns_list_type |
Return type is list |
test_all_entries_are_strings |
Every item in the list is a str |
test_real_package_mcp_folder_has_providers |
Live scan confirms ollama is registered |
Note:
has_gpu()is exercised bytest_ObjDocumentEnhancement.py
(seeObjDocument.md).