Training: Netwerkbeheerder naar Site Reliability Engineer - Deel 3: Chaos Engineer
DevOps
29 uur
Engels (US)

Training: Netwerkbeheerder naar Site Reliability Engineer - Deel 3: Chaos Engineer

Snel navigeren naar:

  • Informatie
  • Inhoud
  • Kenmerken
  • Meer informatie
  • Reviews
  • FAQ

Productinformatie

In deze training ligt de focus op het oplossen van problemen en het creëren van orde in de chaos als Site Reliability Engineer. Je leert hoe je systeemproblemen effectief kan identificeren en aanpakken, wat de verschillende benaderingen voor probleemoplossing zijn, en je leert het proces te vereenvoudigen en te stroomlijnen terwijl je veelvoorkomende valkuilen vermijdt. Met een focus op probleemrapportage, onderzoek, diagnose en testen krijg je inzicht in het waarnemen van recente wijzigingen en het opsporen van oorzaken, waardoor de efficiëntie van het oplossen van problemen wordt verbeterd. Daarnaast verdiep je je in een verscheidenheid aan tools voor probleemoplossing, waaronder logboekregistratie, monitoringtechnieken en cloudgebaseerde oplossingen zoals Google Cloud Dataflow. Door deze tools onder de knie te krijgen, kan jij systeemproblemen snel diagnosticeren en oplossen, waardoor optimale prestaties en betrouwbaarheid worden gegarandeerd.

Vervolgens leer je proactieve planningstechnieken en reactiestrategieën om je voor te bereiden op onverwachte noodsituaties. Je onderzoekt vervolgens verschillende soorten noodsituaties, begrijpt het belang van documentatie en ontwikkelt incidentresponsplannen om downtime te minimaliseren en risico's effectief te beperken. Bovendien verbeter jij je expertise op het gebied van het testen van softwarebetrouwbaarheid, door verschillende testtechnieken en betrouwbaarheidsstatistieken te onderzoeken om foutloze softwarebewerkingen te garanderen. Ten slotte krijg je inzicht in het coördineren en uitvoeren van succesvolle product-lanceringen, waarbij de nadruk ligt op betrouwbaarheid, schaalbaarheid en consistentie.

Inhoud van de training

Netwerkbeheerder naar Site Reliability Engineer - Deel 3: Chaos Engineer

29 uur

SRE Troubleshooting Processes

Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.

SRE Troubleshooting: Tools

Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems, correct them, and prevent them from happening again. In this course, you'll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. You'll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. You'll then work with the various built-in Windows troubleshooting tools, namely the Event Viewer, Resource Monitor, and System Information tools. Next, you'll use Google Cloud Dataflow to process logs, before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly, you'll identify how Google's Dapper is used for troubleshooting distributed systems, and the open standards tool, Prometheus, for instrumenting software and exposing metrics.

SRE Emergency & Incident Response: Responding to Emergencies

Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.

SRE Emergency & Incident Response: Incident Response

A well-prepared and organized approach is key to addressing and managing the aftermath of a system failure, security breach, or cyberattack. In this course, you'll explore the fundamental principles an SRE needs to be familiar with when responding to and managing incidents. You'll identify the goals, requirements, best practices, and key players involved in incident management. You'll learn how to deal with managed and unmanaged incidents and what's involved in an incident response plan. You'll identify incident response roles and responsibilities, and how to use incident metrics to manage incidents at scale. You'll outline what's involved in establishing a computer security incident response team (CSIRT), including each key team member's roles and responsibilities. Lastly, you'll examine what goes into an incident response policy.

SRE Testing Tasks: Software Reliability & Testing

Site reliability engineers (SREs) can use various testing techniques to ensure software operations are as failure-free as possible for a specified time in a specified environment. In this course, you'll explore multiple testing techniques, their purposes, and the tasks involved in their execution. You'll start by examining traditional software testing approaches, such as unit tests, integration tests, and system tests. Next, you'll investigate the components and use cases of various reliability metrics applied to SRE testing, including mean time to failure (MTTF), mean time to recover (MTTR), and mean time between failures (MTBF). Lastly, you'll outline several software testing approaches, such as stress, configuration, integration, acceptance, production, and canary testing, among others. You'll identify when, how, and by whom each of these testing types is carried out.

SRE Testing Tasks: Testing Considerations

Site reliability engineers (SREs) need to create a healthy test and build environment to ensure that products being distributed integrate and function as expected. In this course, you'll explore the fundamentals of creating a robust SRE test and build environment, looking at the standard tools and techniques available for testing at scale. You'll examine disaster and statistical testing, and learn about working with deadlines and production configurations. You'll investigate the topic of test failures, identifying why an SRE should expect specific tests to fail and how results for test failures can help maximize knowledge about operations and end-users. Lastly, you'll look at the why and how of incorporating break glass procedures, integration testing configuration files, and fake back-end versions into your testing procedures.

SRE Load Balancing Techniques: Front-end Load Balancing

Today's distributed systems can consist of hundreds or even thousands of servers, and getting them to work together efficiently is a challenge. Load balancing is a multifaceted concept whose many techniques can help SREs face this challenge. In this course, you'll explore how front-end load balancing works and its associated techniques, concepts, and capabilities. You'll examine the characteristics of load balancers, their use in application delivery and security, and the use of DNS load balancers. You'll outline strategies for virtual IP load balancing, cloud load balancing, and handling overload. Finally, you'll learn how the Google Front End Service, Andromeda virtualization stack, Maglev network load balancing service, and the Envoy edge and service proxy are used for load balancing-related tasks.

SRE Load Balancing Techniques: Data Center Load Balancing

A Site Reliability Engineer (SRE) must know how to perform load balancing within the data center, both internally and externally. In this course, you'll learn about load balancing, including various methods for balancing loads in the data center. You'll begin by examining what data center load balancing is and its importance to performance, as well as load balancing policies. You'll then learn how to deal with unhealthy tasks using flow control, and tips and tricks for optimizing load balancing. Next, you'll examine methods for limiting connection pools with subsetting, and the various load balancing components. Lastly, you'll learn how to balance loads internally and externally using HTTPS and TCP/UDP, and how to balance loads using SSL and TCP proxy load balancing.

Site Reliability Engineer: Managing Overloads

Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.

Site Reliability Engineer: Managing Cascading Failures

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

Distributed Reliability: SRE Critical State Management

Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.

Distributed Reliability: SRE Distributed Periodic Scheduling

Maintaining a distributed system requires constant maintenance to ensure failures don't interfere with that system's reliability and availability. Using periodic scheduling and replication, site reliability engineers can minimize the effect failures may have on a system's performance. One way to automate this process is to utilize the system daemon, cron. In this course, you'll explore how to use cron for task scheduling, the purpose, components, and operators involved in cron jobs, and the format and characters of cron syntax. You'll outline how cron works with distributed periodic scheduling and idempotency, and in largescale deployments. Next, you'll review the PAXOS distributed consensus algorithm, best practices for its use, and how it applies to distributed replication. Lastly, you'll practice scheduling a cron job and using cron syntax generators.

SRE Data Pipelines & Integrity: Data Pipelines

Site reliability engineers often find data processing complex as demands for faster, more reliable, and extra cost-effective results continue to evolve. In this course, you'll explore techniques and best practices for managing a data pipeline. You'll start by examining the various pipeline application models and their recommended uses. You'll then learn how to define and measure service level objectives, plan for dependency failures, and create and maintain pipeline documentation. Next, you'll outline the phases of a pipeline development lifecycle's typical release flow before investigating more challenging topics such as managing data processing pipelines, using big data with simple data pipelines, and using periodic pipeline patterns. Lastly, you'll delve into the components of Google Workflow and recognize how to work with this system.

SRE Data Pipelines & Integrity: Pipeline Design

Site reliability engineers (SREs) encounter numerous and varied pipeline technologies and frameworks in their work. When building a pipeline, SREs need to invest considerable time during the design phase to ensure the results work best for the specific case. In this course, you'll explore the numerous features of a pipeline, such as latency, high availability, development, and operations. You'll also examine the two different pipeline mutations: idempotent and two-phase, as well as the checkpointing technique and various code patterns. You'll then investigate the five core characteristics of the pipeline maturity matrix and outline how they should be used to design the pipeline technology. You'll then identify potential failure modes, outage causes, and different prevention and response techniques. Finally, you'll outline event delivery system design and operations and how to plan for customer integration and support.

SRE Data Pipelines & Integrity: Data Integrity

Data integrity is vital as it ensures end-user data accuracy and consistency in conjunction with an adequate level of service and availability. In this course, you'll learn how to choose a strategy for data integrity, including how to account for any potential upsides and tradeoffs. You'll explore various types of failures that lead to data loss and the existence of the many data failure modes. You'll also identify data integrity challenges. Next, you'll examine in detail the soft deletion, back up and recovery, and early detection layers of defense-in-depth, before investigating the data integrity challenges a cloud developer may encounter in high-velocity environments. Finally, you'll outline considerations for implementing out-of-band data validation and successful data recovery and identify how the primary SRE principles apply to data integrity.

SRE Products at Scale: Product Launches

Site Reliability Engineers (SREs) often contribute to the launch of new products and features. These launches can occur in rapid iterations and at scale, so SREs need to be prepared to help them succeed. In this course, you'll examine launch coordination engineering to build and release reliable and fast products. You'll identify the criteria for a successful product launch and how to develop and use launch checklists to reduce failure and ensure consistency and completeness. Next, you'll outline the techniques used for reliable launches and how launch coordination engineers can help mitigate the repetition of launch mistakes. You'll investigate the production readiness review model used to identify a service's reliability needs. Lastly, you'll outline the characteristics of SRE engagement and early engagement models, as well as SRE engagement frameworks.

Final Exam: Chaos Engineer

Final Exam: Chaos Engineer will test your knowledge and application of the topics presented throughout the Chaos Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

Kenmerken

Docent inbegrepen
Bereidt voor op officieel examen
Engels (US)
29 uur
DevOps
180 dagen online toegang
HBO

Meer informatie

Doelgroep Systeembeheerder, Netwerkbeheerder
Voorkennis

Geen formele voorkennis vereist. Het wordt echter aangeraden om enige voorkennis te hebben van Site Reliability Engineering, Networking en DevOps.

Daarnaast wordt het aangeraden om eerst Deel 1 en 2 van het van het leertraject ‘’Network Admin to Site Reliability Engineer’’ te volgen.

  • Deel 1: Netwerkbeheerder
  • Deel 2: DevOps Engineer
Resultaat

Na het voltooien van deze training ben je klaar om als Site Reliability Engineer orde in de chaos te scheppen. Je hebt een goed inzicht verkregen in onderwerpen als noodrespons en incidentafhandeling, testen op betrouwbaarheid, taakverdeling, overbelasting en trapsgewijze fouten, gedistribueerde betrouwbaarheid, datapijplijnen en integriteit, en het op grote schaal inzetten van producten.

Positieve reacties van cursisten

Training: Leidinggeven aan de AI transformatie

Nuttige training. Het bestelproces verliep vlot, ik kon direct beginnen.

- Mike van Manen

Onbeperkt Leren Abonnement

Onbeperkt Leren aangeschaft omdat je veel waar voor je geld krijgt. Ik gebruik het nog maar kort, maar eerste indruk is goed.

- Floor van Dijk

Training: Leidinggeven aan de AI transformatie

Al jaren is icttrainingen.nl onze trouwe partner op het gebied van kennisontwikkeling voor onze IT-ers. Wij zijn blij dat wij door het platform van icttrainingen.nl maatwerk en een groot aanbod aan opleidingen kunnen bieden aan ons personeel.

- Loranne, Teamlead bij Inwork

Hoe gaat het te werk?

1

Training bestellen

Nadat je de training hebt besteld krijg je bevestiging per e-mail.

2

Toegang leerplatform

In de e-mail staat een link waarmee je toegang krijgt tot ons leerplatform.

3

Direct beginnen

Je kunt direct van start. Studeer vanaf nu waar en wanneer jij wilt.

4

Training afronden

Rond de training succesvol af en ontvang van ons een certificaat!

Veelgestelde vragen

Veelgestelde vragen

Op welke manieren kan ik betalen?

Je kunt bij ons betalen met iDEAL, PayPal, Creditcard, Bancontact en op factuur. Betaal je op factuur, dan kun je met de training starten zodra de betaling binnen is.

Hoe lang heb ik toegang tot de training?

Dit verschilt per training, maar meestal 180 dagen. Je kunt dit vinden onder het kopje ‘Kenmerken’.

Waar kan ik terecht als ik vragen heb?

Je kunt onze Learning & Development collega’s tijdens kantoortijden altijd bereiken via support@icttrainingen.nl of telefonisch via 026-8402941.

Background Frame
Background Frame

Onbeperkt leren

Met ons Unlimited concept kun je onbeperkt gebruikmaken van de trainingen op de website voor een vast bedrag per maand.

Bekijk de voordelen

Heb je nog twijfels?

Of gewoon een vraag over de training? Blijf er vooral niet mee zitten. We helpen je graag verder. Daar zijn we voor!

Contactopties