cbaWorkfow application
Custom Built Apps ltd offers the cbaWorkflow application
cbaWorkflow application is a Linux based workflow application which creates a set of execution nodes for Big Data and warehouse projects. It is written in C++ and uses free Linux libraries
cbaWorkflow keeps track of the execution schedule, states of the nodes, involves data consistency verification in a set of nonblocked components coordinated by the main execution engine.
The execution status is kept in shared memory, so service applications can have access to the map to modify it, if required or obtain a real time information. Other applications can check the execution status from the database repository table. CbaWorkflow persists the map to database on a set interval.
General description
- Workflow will obtain the description of the nodes from a repository (postgreSQL database) and will instantiate node objects.
- Workflow will start a set of nodes which will execute themselves
- the nodes are starting in levels of dependency, so that memory is not waisted on the nodes which are not yet have to execute
- workflow will obtain the dependencies of the current run from repository and pass the dependency to every node it creates.
- each node knows what it depends on by storing an vector of node ids they depend on(reverse adjacency list)
- the nodes will know how to execute themselves
- the nodes will have a method which will run the execution when the dependencies have completed, or sleep if the dependency have not resolved and keep polling the execution status in shared memory segment
- the nodes will update a map in shared memory with their progress
- the nodes will check a shared memory map whether the nodes they depend on have completed execution
- workflow will keep the list of the allocated node objects and delete them after their job is completed
- workflow will rerun the node which failed execution
- workflow will monitor the progress of the execution and asynchronously update the database with the information on start time, end time and progress
Overall process
Resources
- PostgreSQL database -repository
- repository stores information about the nodes
- repository stores information about the current execution status in a table node_execution_status (node_id, execution_status)
- repository stores information about the current execution status(node_id,start_time,complete_time)
- repository stores information about node dependencies (node_id, parent_node_id)
- Framework is implemented using C++
key elements used:
- cbaWFRepositoryConnector - populates application structures from repository(PostrgreSQL or OS files)
- C libpq library - to retrieve data from postgress, needs to exist on the system where application runs. a cbaWFRepositoryConnector derived class is linked to libpq
- boost_managed_memory, boost_interprocess - to keep the execution status in the shared memory and named semaphores to synchronize the access to shared memory
- cbaGraph - calling the graphviz libraries to generate a graph of execution. The data for the nodes are retrieved from the nodes table in postgres, the dependencies are retrieved from the adjacency_lists table in PostrgreSQL
- cbaNodeWalker - reverse engineering a set of hql/sql scripts to obtain information on the nodes(a script) and execution dependency. It can store the information if files or in nodes, ajacency)lists tables in PostgreSQL as nodeFrom→nodeTo format).
workflow framework will call the applications written in different languages and get the result codes of execution and logs parsing
- scala generated jars for spark-submit nodes
- java generated jars for hdfs access
- spark sql scripts for spark-sql nodes
- shell scripts for shell nodes
- Tomcat servlets for generating reports of current status of execution
- Tomcat application to create a set of nodes and populate the dependencies
- Neo4j use is investigated to generate a dependency graph, which could be exported as Cypher query to get a complex list of dependencies.
top of the page