...
from mamba-factotum, run screen
comment, then inside the screen
session, ssh with tunnel to tpfe2 head node.
Basic use of screen
screen -ls
screen -U -R -D pleiades<screen_id>
screen -x pleiades <screen_id> # shared terminal
to split screen: ctrl-a and then shift-s
to detach screen: ctrl-a and then d
Troubleshooting
Problem(s):
SSH tunnel is down
Signs(s):
topsapp queues are stuck; jobs are not being unacked, but the queues are full
nothing reported in mozart/figaro – job,topsapp,job-started
Port checker shell script (in hysds-hec-utils repo) indicates that ports are not forwarded (all should give pass):
Code Block |
---|
esi_sar@tpfe2:~/github/hysds-hec-utils> ./hysds_pcm_check_port_forwarded_tunnel_services.sh
[pass] mozart rabbitmq AMQP
[pass] mozart rabbitmq REST
[pass] mozart elasticsearch for figaro
[pass] mozart redis for ES figaro
[pass] mozart rest api
[pass] grq elasticsearch for tosca
[pass] grq http api
[pass] metrics redis for ES metrics
connect_to 100.67.33.56 port 25: failed.
# [fail] factotum smtp http://tpfe2.nas.nasa.gov:10025
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (52) Empty reply from server |
Note that the mail server failed to respond. (That’s nominal. Mail service no longer used.) The output above indicates nominal status.
Remedy:
from mamba-factotum, open screen
optionally, attach existing screen (32207, tagged pleiades)
ssh with tunnel to tpfe2 head node:
ssh tpfe2-tunnel
note the command is aliased:
alias pleiades='ssh tpfe2-tunnel'
run
sudo -u esi_sar /bin/bash
detach the screen (ctrl-a + d)
Auto-scaling job-workers singularity via PBS scripts
...
Daily purge of older job work dirs
Pleiades has allocated us a quota of 200 TB and 5000000 files. This script finds and deletes all files older than 3-days 2.1 days and under each of the group id worker directories.
crontab
that runs every night at 1:37am and 1:37pm Pacific Time:
Code Block |
---|
esi_sar@hfe1:~/github/hysds-hec-utils> crontab -l 37 1 * * * /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh 37 13 * * * /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh |
Code Block |
---|
cat /home4/esi_sar/github/hysds-hec-utils/purge_old_files.sh #!/usr/bin/env bash find /nobackupp12/esi_sar/s2037/worker/ -type f -mtime +32.1 | xargs rm -f find /nobackupp12/esi_sar/s2037/worker/ -type d -empty -delete find /nobackupp12/esi_sar/s2252/worker/ -type f -mtime +32.1 | xargs rm -f find /nobackupp12/esi_sar/s2252/worker/ -type d -empty -delete find /nobackupp12/esi_sar/s2310/worker/ -type f -mtime +32.1 | xargs rm -f find /nobackupp12/esi_sar/s2310/worker/ -type d -empty -delete |
...
stop auto-scaling scripts
revoke job type: job-request-s1gunw-topsapp-local-singularity:ARIA-446_singularity in mozart-figaro that are in running/queued states.
qdel all jobs
https://github.com/hysds/hysds-hec-utils/blob/master/qdel_all.sh
qstat -u esi_sar | awk '{ if ($8 == "R" || $8 == "Q") print "qdel "$1; }' | sh
then nuke all of the work dirs for the three group ids:
/nobackupp12/esi_sar/s2037/worker/20202021/1102/**
/nobackupp12/esi_sar/s2252/worker/20202021/1102/**
/nobackupp12/esi_sar/s2310/worker/20202021/1102/**
retry all failed topsapp jobs / on-demand submit from runconfig-topsapp
start up auto scaling scripts