NOTICE: All information contained herein is, and remains
the property of TechnoCore Automate.
Updated : 2026-03-16
ObjServiceFilefeed scans folders for document files (.docx, .doc,
.pdf), extracts text content using system tools, runs NLTK sentence
tokenization and POS tagging, and stores results in bloom_filefeed.
Supports recursive folder scanning with first-level folder names
used as client identifiers.
| Extension | Tool | Install |
|---|---|---|
.docx |
docx2txt |
apt-get install docx2txt |
.doc |
catdoc |
apt-get install catdoc |
.pdf |
pdftotext |
apt-get install poppler-utils |
Use python ObjServiceFilefeed.py check to verify which tools
are installed.
settings:
scan_directory: ""
If scan_directory is empty, defaults to
{local_folder}/local.documents/filefeed/.
Document extraction uses subprocess.run() with a list of
arguments (no shell interpolation), preventing command injection
from filenames. Each tool runs with a 60-second timeout.
1. Scan folder recursively
2. Skip already-processed files (check bloom_filefeed)
3. Extract text: docx2txt / catdoc / pdftotext
4. Store raw content → bloom_filefeed.Content
5. NLTK sentence tokenize + POS tag
6. Store tokens → ContentSentence + ContentTokens
| Column | Type | Description |
|---|---|---|
| Client | VARCHAR(255) | Client identifier (first-level folder name) |
| Folder | VARCHAR(512) PK | Source folder path |
| Filename | VARCHAR(255) PK | Document filename |
| DocType | VARCHAR(16) PK | File extension (docx, doc, pdf) |
| Loadtime | DATETIME | When the document was processed |
| Content | LONGTEXT | Extracted raw text |
| ContentSentence | LONGTEXT | JSON array of tokenized sentences |
| ContentTokens | LONGTEXT | JSON array of POS-tagged tokens |
| Module | VARCHAR(255) | ObjServiceFilefeed |
Primary key: (Folder, Filename, DocType)
Recursively scan a folder for documents. Extracts text,
tokenizes, and stores in DB. Returns count of new documents.
| Command | Context Keys | Result Key |
|---|---|---|
scan |
scan_path (optional), client (optional) |
_filefeed_result |
# Scan default directory
python ObjServiceFilefeed.py scan
# Scan a specific folder
python ObjServiceFilefeed.py scan /path/to/documents
# Check which extraction tools are installed
python ObjServiceFilefeed.py check
from ObjServiceFilefeed import ObjServiceApi
svc = ObjServiceApi()
count = svc.scan_folder("/path/to/documents")
print(f"Processed {count} documents")
Updated : 2026-03-16
cythonize -3 -a -i ObjServiceFilefeed.py
Compiling /home/axion/projects/axion/factory.service/package.core/ObjServiceFilefeed.py because it changed..[1/1] Cythonizing /home/axion/projects/axion/factory.service/package.core/ObjServiceFilefeed.py
Updated : 2026-03-16