Operations Checklist
This is a task list outlining the recommended items Operations Engineers look at on a daily basis to ensure ARIA operations is running nominally. Operations support is limited to the standard work day unless there is a predetermined agreement stating otherwise.
Daily Items
Check that services are up and running
Confirm Mozart, Tosca, and ARIA Products pages are all accessible.
Confirm jobs are being processed by reviewing the job status in Mozart and the queue status in RabbitMQ
Confirm there are no stale queues in RabbitMQ or stale jobs in Mozart.
Review Slack alert messages
Resolve the alerts defined in the messages.
Review failed jobs
Investigate cause of failure. Resolve if possible, or contact relevant PGE developer for assistance.
Generate product accountability reports
Generate the AOI reports over the recently-processed AOI’s to assess status of processing campaigns and update the AOI Processing Plan
Reporting
Notify customers of processing updates
Update any appropriate Jira tickets
Review AWS
Ensure there are no runaway EC2 instances in ASG
terminate stale EC2 instances
Verify that the AWS Billing Daily Cost View is at expected levels
update the
Cost AWS In
tab in the AOI Processing Plan with newest costs
Weekly Items
Review and update AOI Processing Plan
update with customers per AOI of possible cleanup of unneeded SLC and other intermediate files (e.g. S1-SLCP, S1-COD, S1-LAR, S1-DPM)
Reduce storage costs by purging data from AWS S3 and on Pleiades
Assess AOIs that are end-dating soon
Purge SLC in AOI’s region (SLCs in both AWS S3 and on Pleiades)
Purge intermediate datasets S1-SLCP, S1-COD, S1-LAR for no longer needed urgent response areas. (mostly in AWS S3)
Clean up trigger rules in Tosca
check stakeholders for their AOI relevant trigger rules
Delete any trigger rules for AOIs that have finished
Deactivate trigger rules if you want to reference parameters in the future