overview

  • the functionality to be tested is integrated execution of cbaWorkflow and cbaWFMonitor
  • The data should be generated/prepared
  • scripts need to be generated
  • shell scripts,spark-sql scripts, spark-submit jars should be deployed into the directory pointed to by the RUN_DIR variable
  • run.sh script should be deployed into the directory pointed to by the RUN_DIR variable
  • nodes and adjacency lists information should be entered into the repository
  • Application run should start in the ${RUN_DIR} directory


Testing schematics


Nodes

recreating OS directory - Shell node(single command)

generating OS files -Shell node (scripts)

put OS files to hdfs - Hdfs node

create External tables - Spark sql node

generating tables with the words distribution - Spark Submit node

store distributions in a single parititioned table - spark sql node

There are 31 work nodes, a start node and and end node.

Testing graphs

starting the cbaWorkflow and cbaWFMonitor with 20 seconds difference to accommodate for the shared memory segment allocation

As seen from the diagram above while level one nodes have completed the level 2 and 3 nodes are polling.

Level 4 and 5 are intact. Level 3 started polling when the first node of level 1 has completed.

In the diagram above it can be observed that level 3 nodes are running, level 4 nodes are polling and level 5 is intact


Final diagram shows all the nodes completed and the end node changed its label to completed

Querying stats table

select n.node_id, n.name,nt.name as nodeType
,nes.start_time,nes.end_time,nes.run_time,nes.status
from nodes_execution_stats nes join nodes n on n.node_id=nes.node_id
join node_types nt on nt.node_type_id=n.node_type_id
where n.node_id not in ('0','99999')
order by start_time desc

node_idnamenodetypestart_timeend_timerun_time(sec)status
55populateFinalTablesparkSQLNodeTue Mar  2 16:42:28 2021Tue Mar  2 16:43:05 202137Success
27herodotus1CountsparkSubmitNodeTue Mar  2 16:36:13 2021Tue Mar  2 16:41:36 2021323Success
25plutarch1CountsparkSubmitNodeTue Mar  2 16:35:13 2021Tue Mar  2 16:41:15 2021362Success
26plutarch2CountsparkSubmitNodeTue Mar  2 16:35:13 2021Tue Mar  2 16:41:08 2021355Success
28herodotus2CountsparkSubmitNodeTue Mar  2 16:35:13 2021Tue Mar  2 16:40:23 2021310Success
12createHerodotus1TablessparkSQLNodeTue Mar  2 16:34:43 2021Tue Mar  2 16:35:18 202135Success
10createPlutarch1TablessparkSQLNodeTue Mar  2 16:33:43 2021Tue Mar  2 16:34:24 202141Success
11createPlutarch2TablessparkSQLNodeTue Mar  2 16:33:43 2021Tue Mar  2 16:34:21 202138Success
13createHerodotus2TablessparkSQLNodeTue Mar  2 16:33:43 2021Tue Mar  2 16:34:22 202139Success
7createHdfsHerodotus1hdfsExecNodeTue Mar  2 16:33:28 2021Tue Mar  2 16:33:57 202129Success
5createHdfsPlutarch2hdfsExecNodeTue Mar  2 16:32:28 2021Tue Mar  2 16:33:04 202136Success
6createHerodotus1shellNodeTue Mar  2 16:32:28 2021Tue Mar  2 16:32:47 202119Success
3createHdfsPlutarch1hdfsExecNodeTue Mar  2 16:32:28 2021Tue Mar  2 16:33:05 202137Success
9createHdfsHerodotus2hdfsExecNodeTue Mar  2 16:32:28 2021Tue Mar  2 16:32:59 202131Success
2createPlutarch1shellNodeTue Mar  2 16:31:28 2021Tue Mar  2 16:31:44 202116Success
8createHerodotus2shellNodeTue Mar  2 16:31:28 2021Tue Mar  2 16:31:43 202115Success
4createPlutarch2shellNodeTue Mar  2 16:31:28 2021Tue Mar  2 16:31:44 202116Success

testing using cronjob

  1. Set all the environment variables in the run.sh script as cron environment is minimalistic: 

export CBAWF_HOME=/home/dataexplorer1/cbaWorkflow
export SPARK_HOME=/app/spark
export SHELL=/bin/bash
export HADOOP_HOME=/app/hadoop
export PG_HOME=/usr/pgsql-10
export YARN_HOME=/app/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=/app/hadoop/lib/native
export MAIL=/var/spool/mail/dataexplorer1
export HADOOP_HDFS_HOME=/app/hadoop
export HIVE_HOME=/app/hive
export HADOOP_COMMON_HOME=/app/hadoop
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
export CBAWFINI=/home/dataexplorer1/.cbaWorkflow.ini
export HADOOP_INSTALL=/app/hadoop
export LANG=en_US.UTF-8
export HADOOP_CONF_DIR=/app/hadoop/etc/hadoop
export HADOOP_OPTS=-Djava.library.path=/app/hadoop/lib/native
export HOME=/home/dataexplorer1

export RUN_DIR=${HOME}/run
export YARN_CONF_DIR=/app/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/app/hadoop
export CLASSPATH=/app/spark/jars/*:/app/hive/jdbc/*:/app/hive/lib/*:/app/tez/*:/app/tez/lib/*:
export LD_LIBRARY_PATH=/usr/local/lib64:${CBAWF_HOME}/lib:${LD_LIBRARY_PATH}
export PATH=/app/spark/bin:/app/hadoop/bin/:${RUN_DIR}:${CBAWF_HOME}/bin:${PATH}
export CBAWFINI=$HOME/.cbaWorkflow.ini
cd ${RUN_DIR}
nohup cbaWorkflowTest 1>cbaWorkflowTest.log 2>cbaWorkflowTest.err &
sleep 20
nohup cbaWFMonitorTest 1>cbaWFMonitorTest.log 2>cbaWFMonitorTest.err &

2. Add the line to cron file:

crontab -e 

01 12 * * * sh /home/dataexplorer1/run/run.sh

testing skipped nodes


The graph above demonstrates that the nodes in the nodes_execution_schedule get excluded from execution


The node after the skipped ones starts and completes: