cbaWorkfow application

Custom Built Apps ltd offers the cbaWorkflow application

cbaWorkflow application is a Linux based workflow application which creates a set of execution nodes for Big Data and warehouse projects. It is written in C++ and uses free Linux libraries

cbaWorkflow keeps track of the execution schedule, states of the nodes, involves data consistency verification in a set of nonblocked components coordinated by the main execution engine. The execution status is kept in shared memory, so service applications can have access to the map to modify it, if required or obtain a real time information. Other applications can check the execution status from the database repository table. CbaWorkflow persists the map to database on a set interval.

General description

Workflow will obtain the description of the nodes from a repository (postgreSQL database) and will instantiate node objects.
Workflow will start a set of nodes which will execute themselves
the nodes are starting in levels of dependency, so that resources are not wasted on the nodes which do not yet hav to execute
workflow will obtain the dependencies of the current run from repository and pass the dependency to every node it creates.
each node knows what it depends on by storing an vector of node ids they depend on(reverse adjacency list)
the nodes will know how to execute themselves
the nodes will have a method which will run the execution when the dependencies have completed, or sleep if the dependency have not resolved and keep polling the execution status in shared memory segment
the nodes will update a map in shared memory with their progress
the nodes will check a shared memory map whether the nodes they depend on have completed execution
workflow will keep the list of the allocated node objects and delete them after their job is completed
workflow will rerun the node which failed execution
workflow will monitor the progress of the execution and asynchronously update the database with the information on start time, end time and progress

Overall process

Resources

repository database,hosted in PostgreSQL or MariaDB or Sqlite3
repository stores information about the nodes
repository stores information about the current execution status in a table node_execution_status (node_id, execution_status)
repository stores information about the current execution status(node_id,start_time,complete_time)
repository stores information about node dependencies (node_id, parent_node_id)
Framework is implemented using C++

key elements used:

cbaWFRepositoryConnector - populates application structures from repository(PostrgreSQL or OS files)
C libpq library - to retrieve data from postgress, needs to exist on the system where application runs. a cbaWFRepositoryConnector derived class is linked to libpq
boost_managed_memory, boost_interprocess - to keep the execution status in the shared memory and named semaphores to synchronize the access to shared memory
cbaGraph - calling the graphviz libraries to generate a graph of execution. The data for the nodes are retrieved from the nodes table in postgres, the dependencies are retrieved from the adjacency_lists table in PostrgreSQL
cbaNodeWalker - reverse engineering a set of hql/sql scripts to obtain information on the nodes(a script) and execution dependency. It can store the information if files or in nodes, ajacency)lists tables in PostgreSQL as nodeFrom→nodeTo format).

workflow framework will call the applications written in different languages and get the result codes of execution and logs parsing

scala generated jars for spark-submit nodes
java generated jars for hdfs access
spark sql scripts for spark-sql nodes
shell scripts for shell nodes
Tomcat servlets for generating reports of current status of execution
Tomcat application to create a set of nodes and populate the dependencies
Neo4j use is investigated to generate a dependency graph, which could be exported as Cypher query to get a complex list of dependencies.

installation and configuration instructions

Installation of RPM

login as root
if the repository for yum is set up then follow the instructions in packaging cbaWorkflow and execute: yum install cbaWorkflow
If the yum hosted Nexus repository has not been set up ,then obtain the rpm , currently cbaWorkflow-1.0.5-1.x86_64.rpm from Nexus or ftp it : wget http://192.168.2.22:8081/repository/cbayum-hosted/cbaWorkflow/1.0.5/1/cbaWorkflow-1.0.5-1.x86_64.rpm
In the directory where the rpm is located issue the following command: yum install local cbaWorkflow-1.0.0-1.x86_64.rpm
Installing: cbaWorkflow x86_64 1.0.0-1 /cbaWorkflow-1.0.0-1.x86_64 8.9 M
Verify the installation:
ls -l /app/cbaWorkflow

rpm -qi cbaWorkflow

Name : cbaWorkflow

Version : 1.0.0

Release : 1

Architecture: x86_64

Install Date: Sat 27 Mar 2021 09:38:29 PM EDT

Group : Applications/File

Size : 9320360

License : available for purchase, free license up to 5 nodes

Signature : (none)

Source RPM : cbaWorkflow-1.0.0-1.src.rpm

Build Date : Sat 27 Mar 2021 04:29:19 PM EDT

Build Host : r02edge.custom-built-apps.com

Relocations : (not relocatable)

Packager : Boris Alexandrov, Custom Built Apps ltd

Vendor : Custom Built Apps ltd, Toronto, Ontario

URL : www.custom-built-apps.com

Summary : workflow application for data pipelines

Description :

The application runs a data pipeline of execution nodes like spark-sql, spark-submit, shell, hdfs commands in orchestration and cadence.

Contains a tomcat based interface for adding the nodes and monitoring the process

check the files list:
rpm -ql cbaWorkflow
/app/cbaWorkflow/bin/cbaWFMonitorTest

/app/cbaWorkflow/bin/cbaWorkflowTest

/app/cbaWorkflow/bin/repobuilder

/app/cbaWorkflow/bin/shmcreate

/app/cbaWorkflow/bin/shmupdate

/app/cbaWorkflow/bin/shmview

/app/cbaWorkflow/etc/cbaWorkflow.ini

/app/cbaWorkflow/etc/config6

/app/cbaWorkflow/etc/env.sh

/app/cbaWorkflow/etc/run.sh

/app/cbaWorkflow/lib/libcbaEndNode.so

/app/cbaWorkflow/lib/libcbaGraph.so

/app/cbaWorkflow/lib/libcbaHdfsExecNode.so

/app/cbaWorkflow/lib/libcbaNodeWalker.so

/app/cbaWorkflow/lib/libcbaShellNode.so

/app/cbaWorkflow/lib/libcbaSparkSQLNode.so

/app/cbaWorkflow/lib/libcbaSparkSubmitNode.so

/app/cbaWorkflow/lib/libcbaStartNode.so

/app/cbaWorkflow/lib/libcbaUtils.so

/app/cbaWorkflow/lib/libcbaWFRepositoryConnector.so

/app/cbaWorkflow/lib/libcbaWFRepositoryConnectorPG.so

/app/cbaWorkflow/lib/libcbaWFSharedMemory.so

/app/cbaWorkflow/lib/libcbaWorkflowNode.so

/app/cbaWorkflow/wars/cbaWorkflow.war

Configuring the cbaWorkflow.ini file

the file is located in the /app/cbaWorkflow/etc
-- FILE CONTENTS
dbname=cbaworkflowdb
user=dataexplorer1
password=2MuchTime
hostaddr=192.168.2.30
port=5432
polling_time=15
nodes_log_dir=/home/dataexplorer1/logs
connector_lib=libcbaWFRepositoryConnectorPG.so
--- END OF FILE
repository connector libcbaWFRepositoryConnectorPG.so is used to connect to PostgreSQL server
set database to the database where a repository is going to be created
update the parameters as needed
login as the user who is going to run the pipeline
copy the cbaWorkflow.ini into your home directory as a hidden file
cp /app/cbaWorkflow/etc/cbaWorkflow.ini ${HOME}/.cbaWorkflow.ini
create the directory for logs:
mkdir /home/dataexplorer1/logs

Setting the environment

login as the user which is going to be running datapipeline.
copy the /app/cbaWorkflow/etc/env.sh to your directory
cp /app/cbaWorkflow/etc/env.sh $HOME
chmod 755 env.sh
vi env.sh
verify the variables match your environment.
add the path to libstdc++.so.6x to LD_LIBRARY_PATH as the first entry, e.g if /usr/local/lib64/libstdc++.so
then edit the LD_LIBRARY_PATH as follows:
export LD_LIBRARY_PATH=/usr/local/lib64:${CBAWF_HOME}/lib:${LD_LIBRARY_PATH}
source the env.sh:
. env.sh
verify the environment:
[dataexplorer1@r01edge ~]$ env | grep CBA
CBAWFINI=/home/dataexplorer1/.cbaWorkflow.ini
CBAWF_HOME=/app/cbaWorkflow
env | grep PATH
LD_LIBRARY_PATH=/usr/local/lib64:/app/cbaWorkflow/lib:/app/cbaWorkflow/lib:/app/hadoop/lib/native:
PATH=/app/cbaWorkflow/bin:/app/cbaWorkflow/bin:/app/spark/bin:/app/spark/sbin:/app/oozie/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/app/hadoop/sbin:/app/hadoop/bin}:/app/hadoop/sbin:/app/hadoop/bin:/home/hdfs/hadoop/sbin:/home/hdfs/hadoop/bin:/app/hive/bin:/app/spark/bin:/home/dataexplorer1/.local/bin:/home/dataexplorer1/bin
which repobuilder
/app/cbaWorkflow/bin/repobuilder
verify there are no unresolved symbols in repobuilder
[dataexplorer1@r01edge ~]$ ldd -r $(which repobuilder)
linux-vdso.so.1 => (0x00007ffca33a8000)
librt.so.1 => /lib64/librt.so.1 (0x00007fc2a7f92000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc2a7d76000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fc2a7b72000)
libcbaWFRepositoryConnector.so => /app/cbaWorkflow/lib/libcbaWFRepositoryConnector.so (0x00007fc2a796f000)
libcbaUtils.so => /app/cbaWorkflow/lib/libcbaUtils.so (0x00007fc2a772b000)
libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007fc2a73b0000)
libm.so.6 => /lib64/libm.so.6 (0x00007fc2a70ae000)
libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007fc2a6e97000)
libc.so.6 => /lib64/libc.so.6 (0x00007fc2a6ac9000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc2a819a000)

Creating the repository

verify the values in cbaWorkflow.ini are correct for the database, which is going to be a repository.
the user in the connection parameters needs to be able to create tables in the database
cat $CBAWFINI
edit if needed
run the generate repository program repobuilder
repobuilder
Generating a new cbaWorkflow repository
Verify the tables have been created in the repository database,using PostgreSQL interface: