I've been working with IT for over ten years and I have experienced many challenges during this time like working within an operations team providing support directly to core business, managing many complex and large infrastructure environments, monitoring metrics from applications and machines to complex business monitoring catching metrics from many sources and combining the data in other to show the health of the business, building cloud environments from scratch or maintaining real large ones and in the last four years I've worked as a SRE helping companies to build and maintain best platforms as possible to turn simple projects in real great products using all that I mentioned above. In the technical scope I am a qualified professional for Network and Infrastructure Management, Cloud Computing Management, Pipelines Development for microservices management and everything that involves DevOps culture, Automations Development to decrease manual tasks, reduce costs and standardize environments through automation tools or Python development. Also advanced experience with Infra As a Code to create complex, large and safe environments. In addition to good knowledge in Kubernetes clusters management and everything that makes up its ecosystem.
Client: Dell
Project: Serasa Experian
As an SRE, I am responsible for creating, configuring, maintaining, and monitoring all development and production environments that support the services developed by the team.
To standardize the environment, I created several custom modules for our team via Terraform to fully automate the creation of resources. This significantly reduced the time to have resources available. Some of the most commonly used resources are: DynamoDB, RDS, SQS, SNS, S3, EMR, and EKS. That way it was also possible to create more comprehensive solutions with multiple types of resources at once. This enabled configurations for more specific situations, such as an audit system or scan that needed to be configured due to global policies or compliance.
One of the main tools used by the team is Airflow. I developed a project where we would start using Airflow on Kubernetes. It was created Terraform project so that all the necessary resources and configurations for the solution would be ready for use in just a few minutes. It was configured the creation of resources on Kubernetes, Helm, Databases (RDS), external logs (S3), Docker, DNS, firewall rules, and automatic DAG synchronization. The project reduced costs by almost 70%, in addition to improving performance. Furthermore, this project was presented to the entire company, and some other squads used it as a reference in their environments.
The cost management needed to be modified because of increase in services and applications. To address this issue, it was decided to map AWS resources, Airflow DAGs, and Kubernetes workloads with tags and/or labels to determine which resources were part of each product or flow. For this project, Python scripts were used to connect to AWS and Kubernetes (Kubecost) APIs to collect cost information and after that to be displayed in Grafana. This made it possible to better manage resources for each application and know exactly how much each service cost the company.
As a DevOps Tech Lead, my mission was to align business expectations with DevOps development. In addition, I made technical decisions on which technologies and methods we should adopt.
By working more closely with the business team, I was able to better understand their needs and take advantage of the best opportunities for DevOps in the area. The focus was on bringing more business monitoring metrics through Observability so the rest of the area could closely track product results.
While still collaborating technically with the team, I helped improve performance and more actively monitor applications using techniques such as APM, Prometheus, ElasticSearch, and log monitoring, all with a focus on Observability.
Beyond the technical side, I was responsible for guiding the team to give them more autonomy in day-to-day decisions, help them create smart strategies to monitor and maintain the environment, and foster a good relationship with the business team.
As a DevOps Engineer, I was responsible for managing the entire product and service pipeline for the New Business area, in addition to spreading the DevOps culture across the team.
The New Business area within the bank was relatively new, so my responsibility was to build the entire environment from scratch. Initially, the focus was on constructing the entire network architecture within AWS, considering all other existing On-Premise and cloud networks within the bank.
The applications were developed using microservices, and a project was created to automate the Kubernetes cluster via AWS's managed service, EKS. The entire project was automated using Terraform and implemented through an infrastructure pipeline exclusively created for this in Azure Pipelines. The project also included all infrastructure resources that would be used by Kubernetes, such as Rancher, Karpenter, and Traefik.
For the applications, which were mostly .NET, pipelines were also created to perform the necessary tests, builds, and deployments in both development and production environments.
My scope also included the creation, configuration, and maintenance of AWS resources required for the applications, such as RabbitMQ, MongoDB, and ElasticSearch. These services were part of a dedicated project for building resources via Terraform and also went through the testing, building, and deployment processes in the pipeline.
With the environments, pipelines, and resources in place, it was possible to develop the first products for the area. The entire implementation process, both for infrastructure and applications, was automated and monitored.
Working within a squad dedicated to monitoring and observability, my responsibility was to keep all monitoring platforms available.
The main focus was to configure and improve the monitoring systems. For this, there was an exclusive Kubernetes cluster for monitoring, where we implemented resources such as Prometheus, Grafana, Logstash, and custom applications that sent metrics to other platforms like databases, Splunk, and New Relic. It was necessary to maintain this cluster by monitoring two key metrics: consumption and requests.
In the same cluster, I managed logs from all company applications. The logs were processed and sent to Kafka, then to New Relic. This was a delicate process, and we used real-time metrics and alerts to track the log processing.
Configuring which metrics to monitor and how to do so was also an important task, so a deep knowledge of tools like Prometheus, Grafana, and New Relic was required to extract the most useful information for the business. I also managed alerts and how the workflow should proceed after an incident.
As an SRE specialized in monitoring, I was responsible for spreading the culture of monitoring within teams and helping them build their own monitoring systems, whether with dashboards, more complex queries, or understanding the monitoring itself.
One ongoing project that I was able to continue was the main business monitoring service we used. It was a PHP application that gathered transaction metrics from various sources, categorized them, and sent them to another database, allowing us to view the information almost in real-time. This monitoring dashboard was used by everyone in the company, including the board, to assess the health of the business at any given moment.
In this period I worked as SRE in a team which supported all squads. My main duty was to maintain all environments stable. In addition I helped specific squads to build their products, pipelines, cloud resources, monitoring and provided them knowledge about the core business in order to help them how their products could perform better in environments. Besides, it was usual to troubleshoot along with the squads or SRE team whenever there was some problem or incident.
I used to work with main services in AWS like: Beanstalk, Lambda functions, RDS, EKS, ElasticCache, EC2. From creating and configuring them to monitoring their health. It was also our duty to spread out the DevOps culture in the company, and with that working hard to build best pipelines to deploy, where developers could rely on to create their tests, build images and have a simple experience to deploy their applications.
I also looked for improvements like reducing costs, improving applications performance, creating automations to reduce risks by doing manual tasks and having some standards to build or run something new.
I got involved in two big projects there, where I was the main person contributing to them. The first one was to migrate all applications to the new EKS cluster, because there were many applications running in many other resource types. The entire project was built in Terraform code and deployed by pipeline. After that, I helped squads to migrate their applications. The second one was to build the first Observability project. I focused on showing the business metrics, then I built several monitors for all steps where the client passed by the product, after that it was possible to see aggregate values and conversion rate so that they could see the business health. To do that I worked hard in collecting metrics from many sources, combining and joining the data, built many dashboards and to finish I created alerts notifications to warn the right teams about a possible business problem.
In the operations area, my objective was to coordinate the service desk team, manage the local infrastructure, provide second-level support for tickets, monitor business, infrastructure, and applications, and assist the squads with development, implementation, bug resolution, and any other production activities. First, I coordinated the implementation of the service desk at Geru. Through roadmaps and a strong initial focus on automation, we were able to manage the entire infrastructure excellently in a short time, while also being 100% compliant with info-sec. We used tools like Ansible to automate processes and Zabbix to monitor the machines.
Another important task I handled was monitoring the local infrastructure. I was also responsible for managing this part, including creating VPNs, making adjustments to the firewall, etc. To ensure better control, the first step was to implement Zabbix in an AWS architecture.
Despite everything, the most important tasks were resolving customer issues and monitoring the business.
As the lead for second-level ticket resolution, I worked directly in production using Python and SQL. I also worked a lot with MongoDB and Influx to help with daily tasks. Additionally, it was important to stay close to developers to identify and fix bugs. I also built a knowledge base so that the rest of the team could absorb the information.
Regarding monitoring, we always used tools like New Relic to monitor our applications, but we had a gap in centralizing other types of monitoring and understanding the real business problem. For this reason, we initiated a large project that allowed us to analyze each stage of the system. We collected metrics from Influx, Prometheus, and CloudWatch, and aggregated the data aligned with the business. From there, we were able to build effective alerts and define the correct actions to be taken.
I was responsible for managing the entire ICT area of the headquarters in São Paulo and the branch in Belo Horizonte. There were around 200 users, and among my main responsibilities were: managing the entire network and infrastructure, resolving tickets of all levels, managing the team, costs, projects, and seeking infrastructure innovations.
On a daily basis, I managed technologies like Hyper-V and VMWare, managed Microsoft servers (AD, SQL Server, TFS), and also Linux servers for the development team. I was also responsible for security, configuring VPNs, firewall rules, and WiFi networks (Cisco, Fortigate).
I implemented improvements like deploying Asterisk to reduce costs and improve the user experience with telephony. I also initiated a monitoring project with Zabbix to keep the entire infrastructure 100% monitored.
I started the company in cloud computing (AWS and Azure). I began by backing up all the servers and then migrated some services as well.
The biggest challenge was aligning infrastructure with the company's business. I believe this was only possible after mapping all the processes. After that, it was necessary to standardize them, and this allowed me to understand the impact of infrastructure on the business. With the proposed improvements, the business no longer suffered from failures.
It was definitely my biggest experience so far. I joined as a Junior Support Analyst and left as a Senior Analyst, performing tasks alongside the infrastructure team. I even coordinated the service desk team for a brief period and was assigned to the 2014 World Cup project, where I spent a few days in Rio de Janeiro providing support during the event.
My mission was to handle requests as quickly as possible, as there were more than 400 users who might need assistance. The support process required effective troubleshooting, and if necessary, escalating the request to level 3 (infrastructure or systems). We maintained constant communication with the global team, as they were responsible for system-specific support and security.
One of the objectives was to maintain the most controlled environment possible, using proactivity to solve problems. Managing stock and inventory was also part of the mission.
I had my first contact with tools such as Cisco's Call Manager for VOIP, VDI with Citrix, and Symantec for backups. Over time, I began managing AD and Exchange as well. Additionally, I executed tasks in systems (Microsoft Dynamics, back-office system for stores) alongside the third-level team.
In another Adidas unit, in the warehouse, I provided support for thermal printers (Zebra) and the supply chain system (Manhattan).
Over time, I also managed purchases, corporate lines, and aligned processes with HR.
All this experience took me to a new level of maturity, technical knowledge, soft skills, and allowed me to understand more closely how technology can influence the business, in this case, the retail business. I gained enough experience to dive deeper into infrastructure and management.
AWS
Kubernetes
Terraform
Linux
Observability
New Relic
CI/CD Pipelines
Python