Логотип GVA
 

Data Engineer
Building steady pipelines
of data-processing

7 weeks with ETL and pipelines

Program Data Engineer is intended for software developers and database administrators who want to change their career path to a new profession related to data. It is data engineering that involves your technical background and strengths in building steady pipelines and production launch. Also, the program is useful for data scientists and data managers who would like to better understand data mining, processing, building steady pipelines to make the data inside an organization available for all potential users.

Behind every product or service - whether it is a recommender system on a website or a mailing personal offers, or a campaign to retain current clients - there is data. Quality of your data impacts the quality of the decisions as garbage in - garbage out. Data has to be up-to-date, correct and preprocessed to help data analysts and data scientists be more efficient in their jobs. Data engineer is responsible for data delivery from different sources (company website, CRM, social networks).

The goal of the program is to teach you to build steady pipelines from data collection to its visualization.

Learning by building. This is a 7-week program and during that time each participant individually works on one core project - collecting clickstream data from your copy of a website and sending the data to two pipelines for:

  • batch processing,
  • real-time processing.

During the program each participant works on his\her own cloud cluster configuring all the necessary tools for data processing: Divolte, Kafka, ELK, Spark, Mesos, Sqoop, Druid, ClickHouse, Superset. It will allow you to explore and master other instruments and create any other pipelines in the future.

TIMING

25 Feb - 15 Apr 2019

SCHEDULE

Mon, Wed, Fri 19.00-22.00

PARTICIPATION FORMAT

Regular classes at MegaFon Office

PRICE

120 000 i

Instructors


    • Anton Pilipenko, Big Data Engineer, Mail.ru Group: "Nowadays, the majority of companies have managed to store big volumes of data and to build different models over it, but often they don’t pay much attention to effective storage and preprocessing. As a result, many issues about sizing and apps scaling, realtime and near-realtime processing appear.
      In my experience, the division on data scientists and data engineers has some grounds. Data engineer is an engineer who understands in details what exactly they are doing and why, what is under the hood and which architecture
      is not going to work out."
    • Nikolay Markov, Senior Data Science Engineer, Aligned Research Group LLC: "Why do people dive into data
      engineering? I think that it is a logical way to build a career in the data field for those who already know how to write code and have experience in software engineering. It seems quite rare that people are equally interested in data
      science and data engineering. It is hard to have in one person decent knowledge of math and the same level
      of computer science skills. So let’s leave mathematicians with what they are good at: research, models and
      visualizations while we will concentrate on how to make a good product from this analytical idea."
    • Artyom Moskvin, Senior Software Engineer, Agoda: "Data Engineer is a person who makes all this big data possible. Work with data could be divided into 2 parts: engineering and science. However, to make the second thing possible you have to work thoroughly with the first one. In our program we will teach you how to build ETL pipelines. These pipelines could be the foundation for all data processing in your company. You will be able to process your data in a real-time or batch way, tune BI tools and make ad-hoc queries for other users, automate machine learning."
    • Andrey Sutugin, Data Engineer, E-Contenta: "In the world of data analysis, everything is not as rosy and beautiful as you may think after you solved the Titanic problem on Kaggle. Before you start analyzing your data you need to do a titanic work, and if you want to make an automated process you will have to do even more. Unfortunately, in big data world there is no silver bullet and the variety of different frameworks and tools can confuse you. Our program won’t solve everything for you or give you 100% answers on how to create an ideal system of ETL, but it can show you direction in which you can grow and provide with the best data processing practices that you can adjust to your business cases."

Partners

Programm

 
MODULE 1


Lambda- и kappa-architecture

Connect clickstream to Kafka and store it in Elasticsearch

Schedulers: Cron, Luigi, Airflow

Working with environment: virtualenv, docker, ansible

Command-line tools for data engineers

Working with relational DB. Druid

Creating a scheduled script for tokenizing data from Elasticsearch

Spark Configuration. Spark Submit

Building an ML-model in Spark that sends the results to Druid

Working with BI-tools. Superset

NoSQL databases. ClickHouse, Tarantool

Making an analytical report with Superset and Druid backend

Real-time pipelines. Storm

Dashboards. Grafana, graphite

Visualization of Storm preprocessing

Systems of log analysis. Sentry

Enterprise-pipelines

Monitoring and troubleshooting a pipeline

Project presentations

Tools

Lab

Lab

Lab

Lab

Lab

Lab
Наверх

We received your request.
Thank you!