Senior Associate - Workload Automation Engineer
Location Designation: Hybrid - 3 days per quarter
Role Summary
Serve as the engineering owner for New York Life's enterprise workload automation ecosystem. You'll operate and harden scheduling platforms and calendars, design resilient restart/rerun patterns, and standardize job definitions, logging, and audit evidence across environments. Your work will ensure critical batch chains run predictably, meet SLAs, and support a consistent, automation-first operating model.
What You'll Do:
Run & Harden the Platform
• Operate and maintain scheduling controllers and agents across environments.
• Manage calendars and holiday tables; configure SLA jeopardy thresholds, alerting, and escalation paths.
• Implement platform upgrades, patches, and configuration changes in line with standards and change governance.
Engineer Reliability & Resilience
• Design restart/rerun patterns (checkpointing, idempotent wrappers) and failure-handling flows for critical batches.
• Model dependencies and schedules as code (job-as-code) in version control with CI/CD-based promotion.
• Reduce single points of failure and improve consistency across job chains and environments.
Standardize & Govern
• Define and maintain standard naming conventions, templates, parameters, and calendars across schedulers.
• Engineer common audit-evidence and log schemas to support internal and external reviews.
• Ensure data retention, traceability, and segregation of duties align with policies and regulatory requirements.
Guardrails, Health & Service Readiness
• Implement pre/post checks, synthetic probes, and health validations for batch workflows.
• Define and maintain SLIs/SLOs for batch completion, success rates, and recovery times.
• Build safeguards that detect anomalies and misconfigurations before they impact downstream processes.
Observability & Operational Excellence
• Integrate schedulers with observability tools (logs, metrics, dashboards) to improve visibility.
• Tune job concurrency, execution windows, and resource usage for performance and cost efficiency.
• Reduce noisy alerts and improve the signal-to-noise ratio for incident responders.
Change, Incident & Release Coordination
• Align scheduler changes, maintenance, and releases with APSO/Change Management processes.
• Lead incident triage and resolution for batch failures, including rapid root-cause analysis and safe restarts/reruns.
• Contribute to post-incident reviews and drive remediation actions into platform and pattern improvements.
Partner & Influence Across Teams
• Collaborate with Application Owners/Developers, DBAs/Data teams, SRE/Observability, Security, and Vendors to keep batch chains healthy and compliant.
• Provide guidance on best practices for job design, scheduling windows, dependencies, and error handling.
• Document patterns, playbooks, and standards; mentor peers and junior engineers in workload automation.
What You'll Bring:
• 5-8+ years of experience in enterprise workload automation, SRE, or production operations supporting mission-critical batch processing.
• Hands-on experience with Stonebranch or at least one major enterprise scheduler (e.g., ESP, Control-M, AutoSys, IBM Workload Scheduler/TWS, Redwood) including:
o Operating controllers/agents across environments.
o Managing calendars/holiday tables and SLA jeopardy configurations.
• Strong scripting and automation skills in PowerShell, Bash, or Python, plus familiarity with YAML/JSON and REST APIs.
• Experience with Git-based workflows and CI/CD pipelines for job-as-code and configuration promotion.
• Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
• Experience integrating schedulers with observability platforms (logs/metrics/dashboards) and defining SLIs/SLOs.
• Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.
Nice to Have
• Experience in financial services or other highly regulated industries.
• Background standardizing multiple schedulers and creating common audit schemas and evidence-capture patterns.
• Relevant certifications such as ITIL, cloud architect/operations, DR/BC (e.g., DRII/BCI), or security (e.g., CISSP).
How Success Will Be Measured
• Reduction in SLA jeopardy and breaches; lower mean time to recover (MTTR) from failed jobs.
• Percentage of batch chains using standardized templates, restart/rerun patterns, and automated pre/post checks.
• Completeness, consistency, and time-to-produce logs and evidence for audits and reviews.
• Reduction in manual interventions and alert noise; improved rate of on-time, successful batch completion.
Working Model
Hybrid role based in New York, NY with periodic on-site participation for key release and batch events. Participation in an on-call rotation for critical batch windows is expected. You'll work within clear governance, established change processes, and close cross-technology collaboration to keep job automation reliable, consistent, and audit-ready.
Pay Transparency
Salary Range: $90,000-$128,500
Overtime eligible: Exempt
Discretionary bonus eligible: Yes
Sales bonus eligible: No
Actual base salary will be determined based on several factors but not limited to individual's experience, skills, qualifications, and job location. Additionally, employees are eligible for an annual discretionary bonus. In addition to base salary, employees may also be eligible to participate in an incentive program.
Apply tot his job
Apply To this Job