overview

cbaWFMonitor is a separate process with the following functionality:

obtains nodes execution status from shared memory and saves it to the repository

obtains and parses the nodes execution logs and stores statistics in the repository

cbaWFMonitor class description

class cbaWFMonitor
{

public:
cbaWFMonitor(string configFile);
~cbaWFMonitor();
void updateNodesExecutionStatus();
void run();
private:
map<string,string>m_variables;
map<std::string,int> m_executionMap;
string m_shmName;
string m_mapName;
int m_pollingTime;
bool _populateMap();
};

updating repository with execution status

monitor reads the shared memory segment, finds execution map and updates the repository  nodes_execution_status table  with this information

node_id

status


node_id is unique for each run.

collecting information from logs and updating repository with execution statistics

each node generates a log in a file node_id.log ,eg 1304.log, and stderror stream in  node_id.stderr, eg  1304.stderr

monitor looks for the logs of the nodes which started execution(status <> 0, 1) - completed, failed, skipped, executing

monitor reads the nodes logs, gets information for the following metrics:

[START_TIME]Sat Jan 30 18:46:32 2021
[STATUS]Failure, exit code:3
[END_TIME]Sat Jan 30 18:46:57 2021
[RUN_TIME]25


the monitor will update repository nodes_execution_stats table or file with the following information

node_id

start_time

end_time

run_time

status

if status field has been populated the monitor stops looking for the node logs.

the node_id may recur in the same run if the node failed , then new log and stats will be generated.

failed nodes logs will be renamed to node_id_failed_1.log and node_id_failed_1.stderr


after the run completion the nodes_execution_stats table contents will be moved to nodes_execution_stats_history table or file

run_id

node_id

start_time

end_time

run_time

status

notification about the failed nodes

monitor will provide notifications to the subsribers (users and cbaWorkflow application) about the failed nodes so they could be restarted