ops guide for request on-demand s1-gunw on Pleiades

AOI Processing Plan

https://docs.google.com/spreadsheets/d/1PH9bOU0jE6bUWqkuf2wCJ_o3Chh-cMu49GKPlNl5M14/edit#gid=0

HEC Group ID allocations on Pleiades and their rabbitmq queues

  • HEC group id s2037

    • program_pi_id: ESI2017-owen-HEC_s2037

    • rabbitmq queue: standard_product-s1gunw-topsapp-pleiades_s2037

  • HEC group id s2252

    • program_pi_id: NISARST-bekaert-HEC_s2252

    • rabbitmq queue: standard_product-s1gunw-topsapp-pleiades_s2252

  • HEC group id s2310

    • program_pi_id: CA-HEC_s2310

    • rabbitmq queue: standard_product-s1gunw-topsapp-pleiades_s2310

PGEs that run on Pleiades job worker singularity

Job Metrics for pipeline

RabbitMQ

  • https://mamba-mozart.aria.hysds.io:15673/#/queues

    • regex: ^(?!celery)

Repo of utils for Pleiades

GitHub - hysds/hysds-hec-utils: HySDS HEC Utilities

SSH Tunnel from mamba cluster to Pleiades head node

from mamba-factotum, run screen comment, then inside the screen session, ssh with tunnel to tpfe2 head node.

Basic use of screen

Troubleshooting

Problem(s):

  • SSH tunnel is down

Signs(s):

  • topsapp queues are stuck; jobs are not being unacked, but the queues are full

  • nothing reported in mozart/figaro – job,topsapp,job-started

  • Port checker shell script (in hysds-hec-utils repo) indicates that ports are not forwarded (all should give pass):

esi_sar@tpfe2:~/github/hysds-hec-utils> ./hysds_pcm_check_port_forwarded_tunnel_services.sh [pass] mozart rabbitmq AMQP [pass] mozart rabbitmq REST [pass] mozart elasticsearch for figaro [pass] mozart redis for ES figaro [pass] mozart rest api [pass] grq elasticsearch for tosca [pass] grq http api [pass] metrics redis for ES metrics connect_to 100.67.33.56 port 25: failed. # [fail] factotum smtp http://tpfe2.nas.nasa.gov:10025 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: (52) Empty reply from server

Note that the mail server failed to respond. (That’s nominal. Mail service no longer used.) The output above indicates nominal status.

Remedy:

  1. from mamba-factotum, open screen

    1. optionally, attach existing screen (32207, tagged pleiades)

  2. ssh with tunnel to tpfe2 head node: ssh tpfe2-tunnel

    1. note the command is aliased: alias pleiades='ssh tpfe2-tunnel'

  3. run sudo -u esi_sar /bin/bash

  4. detach the screen (ctrl-a + d)

Auto-scaling job-workers singularity via PBS scripts

Run autoscaling for each group id in background mode with nohup (no hangup), with max 140 nodes in total across all group ids
esi_sar@tpfe2:~/github/hysds-hec-utils> nohup pbs_auto_scale_up.sh s2037 140 > pbs_auto_scale_up-s2037.log 2>&1 &
esi_sar@tpfe2:~/github/hysds-hec-utils> nohup pbs_auto_scale_up.sh s2310 140 > pbs_auto_scale_up-s2310.log 2>&1 &
esi_sar@tpfe2:~/github/hysds-hec-utils> nohup pbs_auto_scale_up.sh s2252 140 > pbs_auto_scale_up-s2252.log 2>&1 &

note: these commands are wrapped in the following shell script

esi_sar@tpfe2:~/github/hysds-hec-utils> ./all_pbs_auto_scale_up.sh <num_workers> 

Daily purge of older job work dirs

Pleiades has allocated us a quota of 200 TB and 5000000 files. This script finds and deletes all files older than 3-days 2.1 days and under each of the group id worker directories.

crontab that runs every night at 1:37am and 1:37pm Pacific Time:

esi_sar@hfe1:~/github/hysds-hec-utils> crontab -l 37 1 * * * /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh 37 13 * * * /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh
cat /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh #!/usr/bin/env bash find /nobackupp12/esi_sar/s2037/worker/ -type f -mtime +2.1 | xargs rm -f find /nobackupp12/esi_sar/s2037/worker/ -type d -empty -delete find /nobackupp12/esi_sar/s2252/worker/ -type f -mtime +2.1 | xargs rm -f find /nobackupp12/esi_sar/s2252/worker/ -type d -empty -delete find /nobackupp12/esi_sar/s2310/worker/ -type f -mtime +2.1 | xargs rm -f find /nobackupp12/esi_sar/s2310/worker/ -type d -empty -delete

How to stop, flush, and restart production on Pleiades

  1. stop auto-scaling scripts

    1. hysds-hec-utils/pbs_auto_scale_up.sh at master · hysds/hysds-hec-utils

  2. revoke job type: job-request-s1gunw-topsapp-local-singularity:ARIA-446_singularity in mozart-figaro that are in running/queued states.

  3. qdel all jobs

    1. hysds-hec-utils/qdel_all.sh at master · hysds/hysds-hec-utils

      1. qstat -u esi_sar | awk '{ if ($8 == "R" || $8 == "Q") print "qdel "$1; }' | sh

  4. then nuke all of the work dirs for the three group ids:

    1. /nobackupp12/esi_sar/s2037/worker/2021/02/**

    2. /nobackupp12/esi_sar/s2252/worker/2021/02/**

    3. /nobackupp12/esi_sar/s2310/worker/2021/02/**

  5. retry all failed topsapp jobs / on-demand submit from runconfig-topsapp

  6. start up auto scaling scripts