Data Engineering

Curated By

Avatar of Anoop-Viswan

The term Data Engineering is used loosely in different contexts. Here we’re trying to provide some guidance on the learning path and how you can start the journey towards big data engineering one step at a time.

Why Learn Data Engineering ❓

The simple answer is - look around and all you see is nothing but data.

  • World is generating more and more data every second in various forms and we need technology to harness and make sense of data.
  • One other thing is, as you might be knowing, Data Science / Machine learning-related areas are thriving these days and the success of these projects largely depends on the quality of data being used and it is the responsibility of Data Engineers to provide the data in the required way and the required format.
  • Apart from this, there is a lot of innovations and advancement happening in the data engineering technologies and we think it may be exciting for you to be part of it.
  • On the lucrative side, data engineering/science is one of the most in-demand jobs 😎 today and usually comes with a heavy paycheck πŸ’°

First things to learn

First and foremost, all you need is a little bit of curiosity, interest, and willingness to learn. The rest of the things needed will follow

A couple of points to be noted before we get into the thick of things

πŸ‘‰ This learning path is intended for novice and aspiring engineers who want to step into the Data engineering world and not for people with intermediate or advanced knowledge and skill looking for mastery in Data engineering

πŸ‘‰ One of the challenges faced by newbies is the information overload and the sheer number of tools and technologies available in the Data Engineering world. People easily get overwhelmed by this and get confused about how and where to start. So, it is likely that you may not see a lot of the latest and greatest tools and buzz words in data engineering here but that is intentional

Ok enough is enough, let’s get started πŸ§—πŸΌ

When it comes to data engineering, essentially there are only two problems that you need to focus on as a beginner

  1. How you store your data: A lot of organizations are embracing the idea of Data lake as their centralized repositories of structured and unstructured data. You can read a little bit of Data Lake concepts here Data Lake. Try to get an understanding of various distributed storage options available like HDFS (mostly used as on-PREM storage) or S3, Google cloud storage, etc. It is also important to understand the difference between various file formats used. Popular formats are orc, parquet and in some cases, data is stored as JSON or CSV as well. Read a little bit here Big Data File Formats
  2. How you process the data: While it is again true that there is a ton of data processing tools and technologies available, but as a beginner, we would recommend getting some insights into the de facto data processing framework SPARK. In reality, this is a swiss army knife for Data Engineers. Get an overview here SPARK

Now that we got an idea of ' WHAT' needs to be learned for starters, let’s take a look at ' HOW' to learn and where to start

  • Cognitive classes hosted by IBM is a good resource to start with. It is completely free and gives you some (limited) cloud-based Notebooks to learn and practice. Take a look at this

https://cognitiveclass.ai/learn/big-data

Next Things to learn

  • Another great resource is the Book πŸ“š Hadoop Application Architecture This is a bit older version, but the concepts and architecture basics are still relevant and pretty much well organized.
  • For those who are impatient (like me 🀭 ), take a look at the Table of Contents, this will at least give you an overview of the various life cycles of data engineering - How to model /store the data => then how to ingest the data => Afterwards what are the processing options, etc., etc.
  • Official spark documentation is pretty decent. Check it out here Spark Documentation
  • A note on programming language - Spark APIs are available in Scala, Python, and Java. πŸ”˜ The choice is yours.
  • Code written in Scala is much more concise and less verbose. You don’t need to be an expert in Scala to use Spark. Just enough Scala would be fine. Take a look here JUST ENOUGH SCALA.

However, if you are already familiar with Python 🐍 or is inclined towards learning Data science then Python API would be a natural choice

Free, unsolicited, out of context advice 🧐

One thing that I wish should have known earlier is the understanding of how to learn. Some of those things that we are following, or thinks is a good way of learning things may not be a very good and efficient way of learning.

Happy learning πŸ˜€