NOTICE: All information contained herein is, and remains
the property of TechnoCore Automate.
This class contains a method build_name(), that sets up the properties:
Filename, Filepublish, Iconname, Iconpublish, and Clipname depending on the file type (_Doctype) of the document.
The name of the file, its "publish" name, icon name, "publish" icon name,
and clip name (if applicable, like for audio files) are constructed based on the document type.
The document types that are considered in your code include:
"JPG" or "JPEG"
"GIF"
"PDF"
"DOCX"
"CSV"
"XLSX"
"EPUB"
"WAV"
"MP3"
"HTML"
"JS"
"CSS"
"SCSS"
For unknown document types, it leaves the Iconname, Clipname, and Iconpublish as empty strings, indicating an undefined or unsupported format.
Update the package list to ensure you have the latest information about available packages: sudo apt update
Install the Poppler utilities and libraries using the following command:
sudo apt install poppler-utils This command will install the Poppler utilities, which include pdftotext, pdfinfo, and others, as well as the Poppler libraries.
To verify that Poppler is installed, you can use the pdftotext command as an example:
pdftotext -v This command should display the version of Poppler that was installed.
import pytesseract
from PIL import Image
def ocr_image_to_text(image_path):
"""
This function takes the path of an image as input and uses OCR (Optical Character Recognition) to extract the text from the image.
Args:
image_path (str): The path to the image file.
Returns:
str: The extracted text from the image.
"""
# Open the image file
image = Image.open(image_path)
# Use pytesseract to perform OCR on the image and extract the text
extracted_text = pytesseract.image_to_string(image)
return extracted_text
import PyPDF2
def extract_text_from_pdf(file_path):
with open(file_path, 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
text = ''
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
text += page.extractText()
return text
# Example usage
pdf_file_path = 'path/to/your/pdf/file.pdf'
extracted_text = extract_text_from_pdf(pdf_file_path)
print(extracted_text)
rsync --mkpath -r local.documents/ data.documents/
If I want to re-enable PostScript and PDF formats for ImageMagick,
I could make the following changes in the file
/etc/ImageMagick-6/policy.xml
[... about 70 lines deleted ...]
<!-- disable ghostscript format types -->
<!--
<policy domain="coder" rights="none" pattern="PS" />
<policy domain="coder" rights="none" pattern="EPI" />
<policy domain="coder" rights="none" pattern="PDF" />
<policy domain="coder" rights="none" pattern="XPS" />
-->
<policy domain="coder" rights="read|write" pattern="PDF,PS" />
</policymap>
cythonize -3 -a -i ObjDocument.py
15295 | if (__pyx_t_7) {. | ^./home/axion/projects/axion/factory.core/ObjDocument.c:14642:14: note: ‘__pyx_v_pdf_page_count’ was declared here.14642 | Py_ssize_t __pyx_v_pdf_page_count;. | ^~~~~~~~~~~~~~~~~~~~~~
Updated : 2025-09-10
The def_document table includes a CacheTtl column for
per-document browser cache control:
CacheTtl INT DEFAULT NULL
When a document is served via the /documents/{path} route in
WebServer, the CacheTtl value determines the Cache-Control
max-age header on the HTTP response.
| CacheTtl value | Behaviour |
|---|---|
NULL or 0 |
No explicit Cache-Control header; browser uses its own heuristic |
| Positive integer | Cache-Control: public, max-age=<value> in seconds |
-- Cache a logo for 24 hours
UPDATE def_document SET CacheTtl = 86400
WHERE DocName = 'company_logo';
-- Cache a stylesheet for 1 hour
UPDATE def_document SET CacheTtl = 3600
WHERE DocName = 'custom_theme';
-- Disable caching (browser heuristic)
UPDATE def_document SET CacheTtl = NULL
WHERE DocName = 'volatile_report';
The TTL values are cached in memory by WebServer in a
_DOC_TTL_CACHE dictionary that refreshes every 10 minutes,
so changes take effect within that window without a service
restart.
| Test Suite | Tests | Status | Purpose |
|---|---|---|---|
test_ObjDocument.py |
~67 (3 skipped) | ✅ Passed | Core document lifecycle and file operations |
test_ObjDocumentEnhancement.py |
31 | ✅ All passed | AI enhancement pipeline |
dev-env/bin/pytest resource.test/pytests/factory.core/test_ObjDocument.py \
resource.test/pytests/factory.core/test_ObjDocumentEnhancement.py -v
test_ObjDocument.pyTests the fundamental document lifecycle without requiring a database
connection. Fixtures use __new__ to bypass the ObjData init chain and
set attributes manually, mirroring post-read() state.
Utility functions
| Test class / function | What is tested |
|---|---|
test_ocr_pdf_* |
ocr_pdf() delegates to Tesseract for small files and pdfplumber for large ones |
test_retry_* |
Retry helper succeeds on first attempt, eventually succeeds after failures, raises after all attempts exhausted |
test_download_image_* |
download_image() handles success, HTTP failure, and retries |
TestDocumentFileIntegration
Covers DocumentFile — the low-level file record tied to a document.
build_name() produces correct filenames for JPG, PDF, WAV and a fullset_document_folder() and set_document_folder_context()has_file() returns False when no file is present and True when one existscompress(), gcd(), patch_param(), read_on_guid(), get_base_file_name()TestDocumentUserFileIntegration (3 skipped — require DB)
create_and_read, read_on_guid, updateTestDocumentIntegration
Covers Document — the main document class.
read(), read_no_context(), build_name(), has_file(), has_icon(),has_clip(), get_file_name()update_context(), run_workflow_empty(), create()get_icon_url_no_icon(), get_image_url_no_file(), get_disposition()remove_icon(), remove_file(), refresh_icon(), balance()TestDocumentsIntegration
Covers Documents — the collection class.
read(), find_mask(), find_mask_single(), apply_mask(), balance(),mark_tracker()TestDocumentSetIntegration
Covers DocumentSet.
init(), build(), scan_storage_no_folder(), build_detail(),extract_to_storage(), download_to_storage()TestDocumentToolsIntegration
init(), convert_mp3_to_wav_missing()test_ObjDocumentEnhancement.pyTests the AI-powered document enhancement pipeline. All tests are unit
tests; the Document and DocumentSet fixtures are built via __new__
without a database.
TestHasGpu (4 tests) — ObjAI.has_gpu() static method (pure hardware detection).
| Test | What is checked |
|---|---|
test_detects_gpu_via_nvidia_smi |
Returns True when nvidia-smi reports a GPU name |
test_nvidia_smi_empty_output_does_not_detect |
Returns False when nvidia-smi output is blank |
test_falls_back_to_torch_cuda |
Falls back to torch.cuda.is_available() when nvidia-smi is absent |
test_returns_false_when_no_gpu |
Returns False when neither check finds a GPU |
TestMaybeEnhance (5 tests) — Document.maybe_enhance() routing.
_Doenhance is 'N' or not sethas_file() is False)_enhance_realtime() when a GPU is available_queue_for_enhancement() when no GPU is presentTestExtractText (5 tests) — Document._extract_text() per-type dispatch.
| Doc type | Extraction path |
|---|---|
pdfplumber — concatenates page text, truncates to DocumentEnhancementConstants.MAX_TEXT_CHARS |
|
| JPG / JPEG / PNG | ObjAiVision.analyse_image() |
| DOCX (unsupported) | Returns "" |
| Exception during extraction | Returns "" and calls debug() |
TestEnhanceRealtime (3 tests) — Document._enhance_realtime().
_save_enhancement() when _extract_text() returns ""_save_enhancement(summary) with the AI response_save_enhancement() when the AI model returns ""TestSaveEnhancement (3 tests) — Document._save_enhancement().
sql_execute is called with the UPDATE containing the summary textget_queries is called with the key 'save_enhancement'debug() is called after savingTestQueueForEnhancement (3 tests) — Document._queue_for_enhancement().
sql_execute is called with the UPDATE flagging the record for later processingget_queries is called with the key 'queue_for_enhancement'debug() is called after queuingTestProcessEnhancementQueue (6 tests) — DocumentSet.process_enhancement_queue().
| Test | What is checked |
|---|---|
test_returns_zero_without_gpu |
Returns 0 immediately when no GPU is detected |
test_returns_zero_for_empty_queue |
Returns 0 when the SQL query returns no rows |
test_skips_docs_without_file |
Rows where has_file() is False are not enhanced and not counted |
test_counts_successfully_enhanced_docs |
Returns the count of documents that were enhanced |
test_handles_exception_per_doc_and_continues |
An exception on one document is caught; remaining documents still process |
test_respects_batch_limit |
The batch_limit parameter is passed through to the SQL query |