drjobs Site Reliability Engineer (SRE) for Infrastructure or Devops Eng العربية

Site Reliability Engineer (SRE) for Infrastructure or Devops Eng

Employer Active

The job posting is outdated and position may be filled
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

San Francisco - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Job Description

Site Reliability Engineer (SRE) for Infrastructure (SRE for Infra_V4) Client has an opportunity for a Site Reliability Engineer (SRE) for Infrastructure with a major social media platform. Qualified Candidates must be self-motivated and must have learning attitude. Recent and extensive knowledge of Linux server operations, be proactive problem solvers interested in root cause analysis and troubleshooting system issues like performance, debugging process and log analysis. Candidates should also have networking skills understanding TOR, subnet, DNS, TCP, SSH and SSL as well as experience with automation apart from bash at-least comfortable with one more language to automate for e.g., Golang and Python. Storage skills, including familiarity with RAID concepts, LVM, fdisk, multipathing, replication and snapshot are also required. Location: Temporarily Remote; Preferred San Francisco /LA / Seattle, WA others outside the area must be willing to relocate Responsibilities: Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements Manage services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health. Provide production system management, change management, incident response including emergency response and postmortems. Perform various SRE operation including scale up/down, build and maintain clusters Automate various services and workflows On-call rotation is required. Minimum qualifications: Bachelor's degree or above, majoring in Computer Science or related fields Must be responsible, interpersonal self-starters, comfortable with ambiguity, excellent communicators, and problem solvers with 5 to 7 years' experience in technical operations, dev ops and/or infrastructure support with excellent Linux skills. 5+ years of experience in one or more of the following types of systems at their newest versions: Strong hands-on experience with Linux and TCP/IP Networking Strong hands-on experience with Python, GoLang and Shell Scripting Prior experience with configuration and maintenance of common applications such as DNS, Nginx, Docker, Kubernetes, MySQL Available on a 24X7X365 basis when needed for production impacting incidents or key customer events Experience in debugging and automating routine tasks Oracle cloud support, automation experience, technical writing and design experience is a plus. Excellent team player focused on getting things done Must be a motivated, fast learner with solid programming skills Site Reliability Engineer (SRE) for R&D Client has an opportunity for a Site Reliability Engineer (SRE) for Infrastructure with a major social media platform. Qualified Candidates must be experienced DevOps engineers with a background in automation, scripting and object-oriented languages used for automation. Candidates must have Linux admin experience with file systems, networks, and core OS. Experience with K8 administration and deployments Location: Temporarily Remote; Preferred San Francisco /LA / Seattle, WA others outside the area must be willing to relocate Responsibilities: Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements Manage services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health. Provide production system management, change management, incident response including emergency response and postmortems. Perform various SRE operation including scale up/down, build and maintain clusters Automate various services and workflows On-call rotation is required. Minimum qualifications: Bachelor's degree or above, majoring in Computer Science or related fields Must be responsible, interpersonal self-starters, comfortable with ambiguity, excellent communicators, and problem solvers with 5 to 7 years' experience in technical operations, dev ops and/or infrastructure support with excellent Linux skills. Must have the ability to work in a fast-paced environment without constant supervision Must have good troubleshooting skills 5+ years of experience in one or more of the following types of systems at their newest versions: Strong hands-on experience with K8, Salestack, NoSQL, Argos, Linux. Python, GoLang and Bash scripts Available on a 24X7X365 basis when needed for production impacting incidents or key customer events Experience in debugging and automating routine tasks DevOps exposure, CI/CD Pipeline, SLi/SLO's exposure, service ownership, on call support experience, technical writing and design experience is a plus. Excellent team player focused on getting things done Must be a motivated, fast learner with solid programming skills Site Reliability Engineer (SRE) for NoSQL Infra_V6 Client has an opportunity for a Site Reliability Engineer (SRE) for Infrastructure with a major social media platform. Qualified Candidates must be self-motivated and must have learning attitude. Recent and extensive knowledge of Linux basic file systems, memory management, process management, and basic networking along with Linux troubleshooting experience. Must have good understanding of any one of NoSQL, Databases, Redis/Cassandra and previous experience with Python, Bash, Go and any programming language. Must be proactive problem solvers interested in root cause analysis and troubleshooting system issues like performance, debugging process and log analysis. Location: Temporarily Remote; Preferred San Francisco /LA / Seattle, WA others outside the area must be willing to relocate Responsibilities: Responsible for key-value stores and graphic databases Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements Manage services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health. Provide production system management, change management, incident response including emergency response and postmortems. Perform various SRE operation including scale up/down, build and maintain clusters Automate various services and workflows On-call rotation is required. Minimum qualifications: Bachelor's degree or above, majoring in Computer Science or related fields Must be responsible, interpersonal self-starters, comfortable with ambiguity, excellent communicators, and problem solvers with 5 to 7 years' experience in technical operations, dev ops and/or infrastructure support with excellent Linux skills. 5+ years of experience in one or more of the following types of systems at their newest versions: Strong hands-on experience with Linux, Python, Bash (preferred) GoLang, Graph Databases like Neo4J or AWS Neptune Strong hands-on experience with Python, GoLang and Shell Scripting Available on a 24X7X365 basis when needed for production impacting incidents or key customer events Experience in debugging and automating routine tasks Oracle cloud support, automation experience, technical writing and design experience is a plus. Excellent team player focused on getting things done Must be a motivated, fast learner with solid programming skills Site Reliability Engineer (SRE) for Motley's Crew_V8 Client has an opportunity for a Site Reliability Engineer (SRE) for Infrastructure with a major social media platform. Qualified Candidates must be self-motivated and must have learning attitude. Recent and extensive knowledge of Linux basic file systems, memory management, process management, and basic networking along with Linux troubleshooting experience. Must have exposure to distributed systems like consul, zookeeper, mongo dB, basic knowledge of containers cgroup, namespace, overlay volumes, scripting skills required including Bash and Python. Candidates should be curious, motivated learners with excellent communication skills and without requiring constant supervision. Location: Temporarily Remote; Preferred San Francisco /LA / Seattle, WA others outside the area must be willing to relocate Responsibilities: Daily responsibility for cluster operations and maintenance Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements Manage services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health. Provide production system management, change management, incident response including emergency response and postmortems. Perform various SRE operation including scale

Employment Type

Full Time

Company Industry

About Company

10 employees
Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.