C:\Users\Skouperd\AppData\Local\Microsoft\Windows\INetCache\Content.Word\TechnoCore - DDD_Logo Template_001.png
Versuro
MVP and (demo) description
November 2023
Denis D. Dell
TechnoCore Automate (Pty) Ltd t/a TechnoCore
Registration Number: 2016/464893/07
Directors: D. Dell, F. Heyningen, P. Shortridge
Address: 13 Norita Crescent, Rosendal, Durbanville, Cape Town, South Africa, 7550
Phone: 082-512 8506
E-Mail: info@TechnoCore.co.za
Contents
Introduction 5
Structure 6
Data Sources 6
Bureaux (Vericred) 6
Credit Providers - include the PUB part 6
Research 6
Data Layer 6
Data Models 9
Data transport 13
Analytical Model Store - SAS/SPSS/Python hooks 15
Feature Store 15
Reporting Marts 16
Generative AI 16
Decision Layer 17
Designers 18
Execution Switch 20
Data Layer Integration 21
Intervention Layer 21
PUB workers 22
Action Services 22
Payment Gateways 24
Data Flows 25
Client decision switch 25
Customer List Preparation 25
Variable and Model selection 26
Data Collection 27
Decision Switch Run 28
Decision Switch Write to Actions 30
Decision Switch Write to Data Layer 30
Bureau data preparation (creation of variables) 30
Client own data preparation (anonymisation, variable creation, model creation) 31
Models and features data preparation (variable creation, model creation) 31
Introduction
The purpose of this document is to define a Minimum Viable Product (MVP) specification. A Minimum Viable Product (“MVP”) is the most basic version of a product or software that includes just enough essential features to satisfy early users and gather feedback. Its primary goal is to quickly and cost-effectively test and validate the product concept and its potential in the market, allowing for adjustments and improvements based on user input. By focusing on the core functionalities, it will help the team minimise development time and resources while delivering a functional solution that can be iteratively expanded upon as the project progresses and more features are added in response to user needs and market demands.
This document will aim both to describe a complete MVP as well as show a “demo” whereby it presumes certain dependencies may not be met (like the bureau data streaming) and describes planned work-arounds so a demoable product is still created even if the full MVP was not yet possible. It is further envisaged that given the importance of this document and the importance that this document will have for the final scope of the MVP, that this document will undergo a number of iterations.
It should be noted that the scope of the deliverable within the November Sprint is only that of TechnoCore’s understanding of what is expected from the MVP. It is anticipated that the final version of the MVP will only be signed off by all parties after it has undergone a few iterations. In order to meet March/April MVP deadlines, iIt is expected that the final version would need to be signed off before the middle of January failing which the overall project timeline may be under pressure.
Structure
Data Sources
Bureaux (Vericred)
Data will be provided by the credit bureau (likely Vericred). The precise format of the data is not yet known, however, broadly speaking it is expected to have the following:
Structurally, it is likely this data will be supplied to the rest of the process via compressed transport protocols like AVRO but conceptually it could be visualised as per the following example:
{
Search_id: ‘’, # a unique id for this bureau search
IDnumber: ‘’, # Customer SA ID number
Passportnumber: ‘’, # Customer Passport number
Accounts: [
Type: 52 # eg 52 might mean home loan, or credit card
Industry: 7 # eg 7 might mean secured credit
IndType: ‘7-52’, # Combo variable
Unique_id: ‘’,
Balances: [ # balances ordered t-0 to t-x where x is the number of months
534,
635,
755,
800,
…
]
Delinquency: [3,2,1,1,1,0,0,...] # delinquencies ordered t-0 to t-x where x is the number of months
},
{
…
}
]
AvgBalAll: 6500 # aggregate variable
}
Integrations
It is going to be critical that we consider BATCH vs single (one by one) record processing options when we are designing this.
As such there are two data integrations with the credit bureau to be considered:
In both cases the ability to pull only the necessary data as opposed to entire packets would be useful
Credit Providers (Clients)
Decision switch will operate in some form in the client’s environment with two key applications
Integrations
Research
Research data will be created using the research campaigns that Versuro send out to get customer insights.
It is envisaged that this data will be something we can link back to the bureau and client’s customer data in order to create more features, models, journeys
Data Layer
The data layer, often referred to as the data tier, is a fundamental component of a layered software architecture that focuses on the storage, retrieval, and management of data. It serves as the foundational infrastructure where data is stored in databases, file systems, or other storage mechanisms, making it accessible to various parts of an application or system. The data layer is crucial for data-driven applications, as it enables the secure storage, retrieval, and manipulation of data, supporting functions such as data validation, indexing, and data integrity maintenance. It also plays a pivotal role in ensuring data consistency, scalability, and accessibility for other layers of an application, making it a key element in modern software development and data management.
We will use modern data stack data lineage processes and principles in building and managing the data layer. This is critical to future proof our data science plans for the data. We will make use of Metadata, Data Lineage and Data Cataloguing.
The above principles allow the establishment of robust change controls to data structures. This should prevent or at the very least, limit the natural confusion that arises as data disperses into the various structures and marts as its use cases proliferate. We will aim to implement the right framework to achieve the following:
Integrations
Data Models
Data models are used to outline the relationships between different entities and attributes within our system. Most of our MVP will be a flow of data through various systems and these models are to help us with describing how we expect the data to move, interact and be stored as well as the constraints that will govern them.
At this stage not much is known specifically about what data would be included in the data models, however, broadly speaking they will be applied based on the following data sources:
As much as possible, the data modelling process will also be used to set standardised data naming conventions.
Conceptual models
These models provide a high-level overview of the data requirements for a system, focusing on the business concepts and their relationships without delving into technical details.
Data Sources - Credit Bureau
An example of what sorts of data would make up the credit bureau data model is shown below. Note that no assumptions are made at this stage about how the data is being stored
Name
Type
Description
Keys
Relationships
Summary_Variables
Simple key-value
PK: BureauRef
K: CustId
[M:1] : [CustId : CustId] with DataLake_CustomerTable
Tradeline
Complex key-[{value}]
PK: BureauRef
K: CustId
[M:1] : [CustId : CustId] with DataLake_CustomerTable
An example of the field list table for the Summary_variables table from the bureau
Summary_variables field list
Field Name
Type
Description
Index
BureauRef
string
Unique reference for each bureau record
Primary Key
CustId
string
One way encrypted key indicating the customer which can be joined with other tables
key
AverageBalL6M
currency
Average balance of all credit lines over the last 6 months
TotUnsecLines
integer
Total number of current open unsecured credit lines
…
Data Sources - Client
Standardised data models for financial services companies exists
Name
Type
Description
Keys
Relationships
Customer (Party)
Simple key-value
PK: CustId
Accounts
Simple key-value
PK: AccountId
K: CustId
Transactions
Complex key-[{value}]
PK: TransactionId
K: AccountId
K: CustId
Events
Complex key-[{value}]
PK: EventId
K: AccountId
K: CustId
Location
Simple key-value
PK: LocationId
K: CustId
Document
Complex key-[{value}]
PK: LocationId
K: AccountId
K: CustId
Data Sources - Research
Research data shows the results of research campaigns
Data Sources - Analytical Models
These data models would encompass data created in the modelling processes and include data covering:
Logical models
These models refine the conceptual model by introducing more specific information about the data elements, their attributes, and the constraints that govern them. They represent a more technical view of the data structure.
The logical models will be based on decisions about how the data should be stored - for example, if they are to be stored in a structured MySQL table they will be described as follows:
Table Summary_variables
Field Name
Type
Description
Index
BureauRef
varchar(50)
Unique reference for each bureau record
Primary Key
If however, it is to be stored as JSON or document store, it will be described as follows:
Key
Value
BureauRef
varchar(50)
Physical models
These models provide the most detailed representation of the data, specifying how the data will be physically stored and managed in the database. They consider factors such as storage allocation, data access methods, and database performance.
Here we will deal with issues such as whether to partition tables, what sort of backup and storage requirements we should apply (for example, allow auto-increase on an RDS instance to 1 TB, or set up nightly backups, what replication should be set up, etc.)
We will also unpack the data control and storage requirements for Kafka
Data transport
Kafka will be used as the transport mechanism for most of the data as it moves between systems. The expected Kafka touchpoints will include the following. Note that instead of using master-slave replication from primary data base to a secondary database, its more efficient to have both primary and secondary subscribers to the same Kafka publisher.
Publisher
Subscriber
Credit Bureau
Primary DB
Credit Bureau
Analytics DB
Client
Primary DB
Client
Analytics DB
Research
Primary DB
Research
Analytics DB
Kafka wont be used in the modelling development - instead data connectors that allow the modelling tools to engage with the modelling data marts will be used. For example, an SPSS ODBC connector directly to the data source.
Data replication and data transformation workers will be used to generate the following data repositories:
When models or features are updated, robust change processes will be required for their changes to be applied to the in-production code that is making use of the models and features.
Integrations
Analytical Model Store - SAS/SPSS/Python hooks
The data required for the modelling will be stored as is most appropriate, this is anticipated to be:
For the MVP datalake style storage models (using spark, HDFS, etc.) will not be implemented
Modelling tools like SAS/WPS, SPSS or python will be able to engage with these data sources using standard database connectors.
For the MVP, specialist GPU servers if modelling requiring CUDA is needed will not be implemented.
Integrations
Feature Store
A feature store is a critical component in modern data-driven organizations, providing a centralized repository for managing and organizing the features or data attributes used to build machine learning models and analytical applications. It serves as a bridge between data engineering and data science, facilitating the seamless transition from raw data to model development and deployment. Feature stores enable data scientists and machine learning engineers to efficiently access, share, and reuse features, reducing redundancy and promoting consistency in feature engineering. By promoting the standardization and versioning of features, feature stores enhance collaboration, speed up model development, and improve the overall governance of machine learning assets. They play a vital role in ensuring that organizations can leverage data effectively to drive insights and make informed decisions, while also maintaining data quality and lineage throughout the machine learning lifecycle.
A feature store will be created to store common, curated, signed off features that have been developed by the modelling team for use in the models.
The feature store features will be clearly defined in terms of the source variables and calculations used
Integrations
Reporting Marts
A reporting mart, also known as a data mart, is a specialized database or data repository designed to store and manage data for the specific purpose of generating reports and supporting business intelligence (BI) and data analytics initiatives. It typically contains a subset of data from an organization's larger data warehouse, tailored to the needs of a particular business unit or department. Reporting marts are optimized for quick and efficient retrieval of data, making them ideal for generating ad-hoc and pre-defined reports, dashboards, and data visualizations. They play a crucial role in providing decision-makers with timely, relevant, and structured information, helping organizations make informed choices, monitor performance, and gain valuable insights into their operations. Reporting marts simplify the reporting process, improve data accessibility, and empower users to explore and analyze data independently, contributing to better decision-making and strategic planning.
For the MVP, a list of no more than 5 reports will be created based on core specifications provided by the business. Every attempt will be made to make these reports performant. Initial reports will be outputted via MS Excel or MS PowerBI
Integrations
Generative AI
Generative AI, a subset of artificial intelligence, is a technology that empowers machines to create content, data, or other output that is not explicitly programmed but generated autonomously. It leverages techniques such as neural networks, deep learning, and natural language processing to produce text, images, audio, and even video. Generative AI has been instrumental in various applications, including natural language generation, where it can craft human-like text or creative writing, as well as in image generation, where it can produce artwork and even realistic faces. This technology has sparked excitement and debate for its creative potential, while also raising ethical concerns regarding issues like misinformation, copyright, and privacy. Generative AI is an area of ongoing research and development, with ever-evolving capabilities and potential applications across diverse domains.
For the MVP, no work will be done on Generative AI other than setting up the data structures to allow for future application of genAI models like Llama2.
Integrations
Decision Layer
The decision layer, often referred to as the decision-making layer, is a crucial component within various computational systems and software architectures. It plays a central role in processing data and information to make informed choices and determine the appropriate course of action based on predefined rules, algorithms, or logic. This layer encompasses decision engines, business rules, and decision support systems that analyze data, assess conditions, and trigger actions or outputs in response to specific criteria. Decision layers are prevalent in fields such as artificial intelligence, business intelligence, and process automation, where they aid in automating routine decision-making, enhancing efficiency, and ensuring consistency in complex systems. They are instrumental in driving intelligent processes, optimizing resource allocation, and supporting real-time decision-making in a wide range of applications, from finance and healthcare to manufacturing and logistics.
The decision layer is where the processes, decisions are generated and models deployed.
The tools created will be able to read in business specifications from the following standards:
The code provided will be able to take instructions in the format of these designs and apply them in a serverless high speed architecture to convert payload inputs into appropriate outputs.
The components involved in the decision layer will include:
BPMN, or Business Process Model and Notation, is a standardized visual language and notation used in business process management and workflow modeling. It provides a clear and universally accepted way to graphically represent and document various aspects of a business process, such as tasks, events, gateways, and the flow of activities. BPMN diagrams enable organizations to depict their processes with precision, facilitating communication and collaboration among stakeholders, including business analysts, process designers, and IT professionals. The notation uses a set of symbols and conventions to represent the sequence, structure, and behavior of processes, making it a valuable tool for modeling, analyzing, and optimizing business workflows. BPMN's versatility and ability to capture both high-level and detailed aspects of a process make it a cornerstone in the field of process management, helping organizations improve efficiency, transparency, and alignment with business objectives.
DMN
DMN, or Decision Model and Notation, is a standardized graphical notation used for modeling and representing decision logic and business rules within an organization's processes and systems. It provides a clear and structured way to express decision-making logic and rules, enabling organizations to document, analyze, and automate complex decisions. DMN diagrams use a set of symbols and elements to represent decision tables, input data, and output results, making it easier for business analysts and decision-makers to define, understand, and manage decision rules. DMN complements BPMN (Business Process Model and Notation) by focusing on the specific logic governing business decisions, allowing organizations to create transparent, consistent, and reusable decision models that can be integrated into various applications and systems. DMN has become an essential tool for businesses seeking to enhance decision-making processes, reduce risk, and improve agility in a rapidly changing business landscape.
CMMN
CMMN, or Case Management Model and Notation, is a standardized graphical notation used for modeling and managing dynamic and unstructured business processes and cases. Unlike traditional business process modeling notations like BPMN (Business Process Model and Notation), CMMN is specifically designed to address complex and adaptive scenarios where the sequence of activities may not be predetermined. CMMN diagrams use visual elements to represent cases, stages, tasks, events, and milestones, allowing organizations to create flexible and adaptive models for handling cases, incidents, or processes with a high degree of variability. CMMN is valuable where cases can evolve unpredictably and require agile and collaborative management. It enables organizations to improve case management efficiency, provide better customer experiences, and respond to changing requirements in dynamic environments.
PMML
PMML, or Predictive Model Markup Language, is an XML-based standard used for representing and exchanging predictive models created in various data analytics and machine learning tools. PMML allows organizations to export trained machine learning and predictive models from one platform and import them into another, without the need for complex and time-consuming model rebuilding. This interoperability helps bridge the gap between data analytics, model development, and deployment, facilitating the integration of predictive models into various applications and systems. PMML supports a wide range of predictive models, including decision trees, regression models, neural networks, and more, making it a versatile and valuable tool in the field of data science and predictive analytics. It plays a significant role in enhancing the portability and scalability of predictive modeling solutions across diverse industries and applications.
CJMN
CJMN, or Customer Journey Model Notation is a standardized framework and visual notation used for mapping, documenting, and analyzing customer journeys. It offers a structured way to represent the various stages and touchpoints in a customer's interaction with a product, service, or brand. CJMN typically uses symbols, diagrams, and notations to depict customer experiences, emotions, and behaviors throughout the journey. This notation helps businesses gain insights into the customer experience, identify pain points, and optimize interactions to enhance customer satisfaction and loyalty. By providing a common language for visualizing customer journeys, CJMN facilitates communication and collaboration among teams and stakeholders, making it a valuable tool in the field of customer experience management and design.
Execution Switch
An execution switch is a component or mechanism in computer systems and network architecture that facilitates the efficient routing and management of data, processes, or tasks within a system. It operates as a controller that can determine the flow of operations or instructions based on specific conditions, commands, or triggers. Execution switches are prevalent in various computing contexts, including network switches that route data packets to their destination, process schedulers in operating systems that manage task allocation, and decision-making components in automation and control systems. They play a vital role in optimizing system performance, ensuring efficient resource allocation, and responding to dynamic operational demands, making them a fundamental part of modern computing and communication infrastructures.
Data Layer Integration
Data layer integration refers to the process of seamlessly connecting and harmonizing data from various sources and systems within an organization. It plays a pivotal role in data management and analytics, allowing businesses to aggregate, transform, and make sense of diverse data types, whether they originate from databases, applications, IoT devices, or external sources. This integration typically involves the use of middleware, ETL (Extract, Transform, Load) processes, APIs, and data connectors to ensure data compatibility and consistency. A well-executed data layer integration strategy empowers organizations to achieve a unified and holistic view of their data, which, in turn, supports informed decision-making, enables data-driven insights, and enhances overall operational efficiency. It also plays a fundamental role in modern data architectures and data-driven initiatives, such as business intelligence, machine learning, and advanced analytics.
Intervention Layer
An intervention layer, often found in software and information systems, is a critical component designed to interact with and influence the behavior or processes of a system. It acts as an intermediary between various system components and external stakeholders, allowing for real-time monitoring, control, and intervention when specific conditions or events occur. The intervention layer is employed to enforce rules, trigger automated actions, and provide decision support to maintain the system's desired state, security, and performance. It plays a key role in areas like cybersecurity, process automation, and network management, where timely responses to anomalies or deviations are essential. The intervention layer is instrumental in ensuring system integrity, protecting against potential threats, and optimizing system performance by enabling proactive responses to emerging issues or opportunities.
PUB workers
Credit Providers
Action Services
Email
Email, short for electronic mail, is a fundamental communication tool in the digital age. It allows individuals and organizations to send messages, documents, and multimedia content across the internet, delivering information quickly and conveniently. With its widespread use, email has become a primary means of professional and personal communication, enabling correspondence, collaboration, and information sharing on a global scale. It provides the flexibility to access messages from various devices and platforms, and email services often include features like attachment sharing, filtering, and organization. Despite the evolution of other communication channels, email remains a ubiquitous and essential tool for both formal and informal exchanges, making it a cornerstone of modern communication.
SMS
SMS, or Short Message Service, is a widely used telecommunications service that enables the exchange of short text messages between mobile devices. It has become an integral part of modern communication, allowing individuals to send brief, text-based messages quickly and efficiently. SMS is not only used for person-to-person communication but also serves various business and information-sharing purposes, including two-factor authentication, appointment reminders, and marketing campaigns. With its broad accessibility and ubiquity, SMS remains a reliable and convenient means of staying connected and conveying information in a fast-paced, mobile-centric world.
Whatsapp
WhatsApp is a popular messaging application that has revolutionized the way people communicate globally. Acquired by Facebook, it allows users to exchange text messages, voice calls, and multimedia content such as photos and videos, using an internet connection. Known for its end-to-end encryption, WhatsApp prioritizes privacy and security, making it a trusted platform for personal and business communication. With features like group chats, voice and video calls, and status updates, WhatsApp offers a versatile and user-friendly experience. It has also become a significant tool for businesses, enabling customer support, appointment reminders, and even e-commerce transactions through its Business API. WhatsApp's widespread adoption has made it a central hub for social interaction, helping users connect with friends, family, and colleagues, regardless of geographical boundaries.
SFTP
SFTP, or Secure File Transfer Protocol, is a network protocol used for securely transferring files and managing data across networks. It is designed to provide a secure and encrypted method of file transfer, making it a reliable choice for organizations and individuals who need to exchange sensitive data over the internet or within private networks. SFTP uses SSH (Secure Shell) to establish a secure connection, ensuring data integrity and confidentiality. Unlike its predecessor, FTP (File Transfer Protocol), SFTP encrypts both the data being transmitted and the authentication credentials, protecting against eavesdropping and unauthorized access. SFTP is widely adopted in various industries, including IT, finance, and healthcare, where secure data exchange is crucial. It offers a robust solution for secure file sharing, remote server management, and automated data transfer processes, making it a valuable tool in modern data management and network administration.
Payment Gateways
Payment gateways are essential components of e-commerce and online transactions, facilitating the secure transfer of funds between customers and businesses. These platforms act as intermediaries that connect online merchants with payment networks, ensuring the smooth processing of payments from various sources, including credit cards, digital wallets, and bank transfers. Payment gateways play a crucial role in verifying and encrypting sensitive financial data, protecting both buyers and sellers from fraudulent activities. They enable businesses to accept a wide range of payment methods, offering convenience to customers and expanding the reach of online sales. With their role in ensuring secure and efficient online payments, payment gateways are integral to the success of e-commerce and the broader digital economy.
Data Flows
There are a number of data flows through the system
An example of a data journey would be “CreditHealthCheck”
Variable and Model selection
The chosen Data_journey will be linked to what variables and models are required for this customer. E.g.
For variables it might have:
Data_journey
Variable Source
Variable
CreditHealthCheck
Bureau
Average bal L6M
Bureau
Worst Delq on unsecured
Model A
xyz
…
…
…
For models it might be
Data_journey
Model Type
Model
CreditHealthCheck
Logistic
credit_b1
CatBoost
attrition_a5
…
…
…
For Data Flows it might be
Data_journey
BPMN
Flow
CreditHealthCheck
BPMN
credit_health
BPMN
finance_health
…
…
…
These will inform the system as to what data and models it needs to invoke
Data Collection
Based on the previous section, the json packet will be enhanced with the required information, e.g.
{
Run_id: ‘’, # a reference that uniquely identifies this specific run
…
Data Flows : [
‘credit_health’,
‘finance_health’,
],
Variables: {
‘Bureau’: {
‘AvgBalL6M’: 2567,
WorstDelq
}
‘Model A : {
‘Xyz’: 5
}
},
Models: {
‘Logistic’: [‘credit_b1’]
‘CatBoost’: [‘attrition_a5’]
}
}
Decision Switch Run
The Decision Switch run will have the following already setup
The payload will enter the DS node and the node will use the information in the payload to run the data through the correct models and data flows and craft the outputs required.
The exit payload will be enhanced with the outputs of the activity, e.g. the payload now might have the following additional sections:
{
Run_id: ‘’, # a reference that uniquely identifies this specific run
…
‘Outputs’: {
Credit health: ‘Diamond’,
Finance health: Bronze,
Data Flows: [
{ Name: ‘credit_health’,
Nodes Activated: [ ‘a1’, ’a2’, ’a7’, ’b5’, … ]
},
…
}
}
The output of this run will provide the following:
Decision Switch Write to Actions
Using Kafka, a subset of the decision switch payload outcomes will be used to write to action topics that are being monitored by action services that will then be triggered.
Ideally, everything the action would need would be part of the write to the topic so the action service does not need to go get customer level data from the data layer.
Decision Switch Write to Data Layer
Using Kafka, the decision switch will write to the data layer in 3 ways:
Bureau data preparation (creation of variables)
Bureau data journey triggered by data changes (f700, applications)
It is presumed that this might be covered by Hardy Jonck’s software
There may be a need to implement feature creations from the bureau data - ideally within the bureau.
Client own data preparation (anonymisation, variable creation, model creation)
This is where data needs to be anonymised, masked for interaction with other data sources.
It also covers where special variables are created for customers with specific correlations to desired outcomes and the creation of models based on data available in the client space (which could include bureau data)
Models and features data preparation (variable creation, model creation)
The data will be stored in a RDBMS style to support the current modelling, analytical tools that are available to the team (WPS,SPSS, SQL).
A normalised database will be created that will allow the analysts play areas to work with the data and create models and variables/features
A structured, well controlled process will be created that will allow for models and features to be uploaded for production purposes to be used in the decision switch data flow