cbaWorkflow : cbaWorkflow logical design

Read information from Repository and populate memory structures to work with
1. nodes
2. adjacency lists
3. execution schedule
In parallel do the following tasks:
1. create a shared memory segment with a map of node_id and execution status
2. calculate execution levels using adjacency lists, generating adjacency matrix and store the levels in vector<vector<node_id>>
3. create cbaWorkflowNodes objects for the nodes to be executed in the current run and store the pointers in a map<node_id,cbaWorflowNode*>
Read information from the levels vector and start nodes of levels 0 and 1 in a separate thread for each node. They will be polling the shared memory segment for the completion of the nodes they depend on. Update current execution level counter to 1, which happens to be the index of the levels vector.
Start node 0, which will trigger the execution of level 0 nodes
Start a loop of polling the shared memory segment execution
1. check the nodes completion status, and for the completed nodes do the following
  1. start the level of N+2 if not started, and update the current level counter
  2. clean the memory of the node if not cleaned
2. restart the failed nodes for a set number of times (does not look feasible, manual restart is a safer option)
Get out of the loop when the end node status has changed to completed.

The nodes may have their execution schedule

the repository table nodes_execution_schedule contains 2 fields

node_id - the id of the node matching nodes table

scheduled_days has the crontab based format

0

1

2

3

4

minutes

0-59

hours

00-23

day of month

1-31

month

1-12

day of week

0 -Sunday

6-Saturday

* * * * 0,5,6 → the node should run only on Sundays, Fridays, Saturdays, otherwise workflow should mark it as skipped. This could be useful for weekly reports or for the data sources consumed on specific days of week.
* * 15 * * → the node should run only on the 15th of the month, otherwise workflow should mark it as skipped
30 16 * * * → the node cannot start before 16:30, workflow should start the node (polling) at the appointed time, not before, e.g. Trading report on the open market, if next day is meant then day of month and month are required, e.g if the pipeline starts at 20:00 on March 6 then the script starting at 00:01 should be scheduled as 01 00 7 3 *

automatic restart

Probably not a good idea to restart a node automatically if the stage of failure is unknown

but if the node could be restarted then there should be a way to restart it once, or twice

manual restart

Manually the node which failed during automatic execution could be restarted with the following procedure

Identify the failed node using graph of execution(red node), nodes_execution_stats table : status Failure, or check the shared memory for -1 values: shmview | grep -1
Identify the kind of task performed by the node using nodes table, e.g. node 154 - Sparksql node
check the script for errors (wrong field name, etc) and correct them
run the script outside the datapipeline and verify results
mark the node is completed, so the pipeline could continue : shmupdate 154 4

Attachments: