Lead and manage a multidisciplinary team of operations, support engineers, and administrators responsible for the IAM system’s availability, performance, and security.
Ensure the ongoing health and operational status of the IAM system, maintaining high availability, performance monitoring, and reliability.
Oversee incident response, root cause analysis, and problem resolution to minimize service disruptions and downtime.
Collaborate closely with the Site Reliability Engineering (SRE) team to implement reliability improvement strategies, monitoring solutions, and automation processes to enhance system uptime and performance.
Develop and optimize Standard Operating Procedures (SOPs) for the deployment, configuration, monitoring, and management of IAM resources.
Work with agency stakeholders to maintain and improve processes around user access, identity lifecycle management, and compliance.
Coordinate the implementation of new IAM features and upgrades in a manner that aligns with system reliability and security best practices.
Collaborate with security teams to ensure IAM systems adhere to all security requirements, including periodic vulnerability assessments, logging, and auditing.
Drive continuous improvement through automation, monitoring, and proactive issue identification to ensure system scalability and cost-effectiveness.
Ensure compliance with disaster recovery and business continuity requirements, including executing regular failover testing and recovery plans.
Monitor key performance indicators (KPIs) to measure and improve system uptime and service levels.
Requirements
10+ years of experience managing mission-critical IT operations, preferably in a federal government or highly regulated environment.
7+ years of direct experience in Identity and Access Management (IAM) operations or related fields.
Experience working in an Agile DEVOPS environment and SAFe.
Hands-on experience with automation tools (e.g., Terraform, Ansible, Puppet, Jenkins) and CI/CD processes or managing a team that performs automation.
Proven experience collaborating with Site Reliability Engineering (SRE) teams to enhance system reliability, optimize monitoring strategies, and automate issue resolution.
Strong leadership and communication skills to mentor and manage technical teams, as well as interface with senior agency officials.
Proven experience with implementing and supporting IAM Single Sign On (SSO) solutions.
Solid understanding of identity lifecycle management, authentication methods (SAML, OAuth, MFA), and directory services (LDAP, Active Directory).
Familiarity with federal IT security and compliance standards (FISMA, NIST SP 800-53).
Strong knowledge of system monitoring tools, incident management frameworks, and automation practices (e.g., using tools like Splunk or ServiceNow).
Experience with cloud services (AWS, Azure) and hybrid environments.
Excellent problem-solving and decision-making skills.
Tech Stack
Ansible
AWS
Azure
Cloud
Jenkins
Puppet
ServiceNow
Splunk
Terraform
Benefits
Medical, Dental & Vision Coverage
Wellness Program
401(k) Matching
Disability (Short Term & Long Term)
Employee Assistance Program
Life Insurance
Education & Training
Generous Leave Policy (11 Federal Holidays, PTO, and Military Leave)