Site Reliability Engineer Lekki, Nigeria
Nomba is a leading payments company with a mission to revolutionise the way businesses manage their financial transactions and affairs. We provide innovative, secure, and user-friendly solutions that enable businesses to streamline their payment processes, optimise their financial operations, and grow their businesses with confidence.
As a Site Reliability Engineer at Nomba, you will play a crucial role in bridging the gap between software development and IT operations, with a strong focus on solving IT operations problems using software engineering. You will be responsible for designing, implementing, and maintaining the infrastructure, tools, and processes required to support our development and deployment pipelines. You will react in real time to production incidents and work to contain and resolve them as quickly as possible. Your expertise in automation, cloud technologies, and continuous integration/continuous deployment (CI/CD) will ensure that our software is delivered efficiently, reliably, and at scale.
About the role
Implement and maintain highly available, scalable, and secure production systems, emphasising automation and Infrastructure as Code (IaC) principles.
Collaborate with software development teams to influence the architecture and design of applications for better scalability, reliability, and performance.
Develop and maintain monitoring, alerting, and logging solutions to proactively detect and resolve system issues.
Respond to incidents and outages, conducting root cause analysis, and implementing preventative measures to minimize future occurrences.
Participate in on-call rotations and provide timely response to critical incidents.
Continuously improve system performance through performance tuning, capacity planning, and load testing.
Implement security best practices, ensuring that systems are compliant with industry standards and regulations.
Automate routine operational tasks using scripting and programming languages.
Work with cross-functional teams to define and document operational procedures and runbooks.
Contribute to the improvement of the CI/CD pipelines to ensure seamless deployments.
Keep abreast of industry trends, emerging technologies, and best practices in SRE and cloud infrastructure management.
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
Proven experience as a Site Reliability Engineer, DevOps Engineer, or a similar role managing large-scale, highly available production systems.
Solid experience with cloud platforms (e.g., AWS, Azure, GCP), including proficiency in provisioning and managing resources.
Strong understanding of Linux/Unix systems and command-line utilities.
Proficiency in at least one programming or scripting language (e.g., Python, Ruby, Bash, PowerShell).
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
Familiarity with monitoring tools and concepts (e.g., Prometheus, Grafana, ELK stack).
Understanding of networking protocols, load balancing, and firewalls.
Strong problem-solving and troubleshooting skills, with a focus on root cause analysis.
Excellent communication and collaboration skills to work effectively with cross-functional teams.
Nice to have
Relevant certifications in SRE, DevOps, or cloud technologies.
Experience with databases and data management (e.g., SQL, NoSQL, caching systems).
Knowledge of configuration management tools (e.g., Ansible, Puppet, Chef).
Understanding of Agile methodologies and experience in Agile/Scrum environments.
Familiarity with security practices and compliance frameworks.
Note: The job description provided is a general overview and may be customised to align with the specific needs and requirements of the company and its ongoing projects.