Data Engineering

From PKC
Revision as of 06:09, 18 September 2022 by Ntustma012 (talk | contribs)
Jump to navigation Jump to search

The word "engineering" is crucial to comprehending what data engineering is. Engineers create things from scratch. Data engineers plan and construct pipelines that transform and ship data into a format that is highly useable when it is received by data scientists or other end users. These pipelines are required to gather data from various unrelated sources and combine it with data from other sources into a single warehouse that acts as a single source of truth for all of the data.

Sounds straightforward enough, but this position requires a lot of data literacy abilities. Data engineers are in such low supply, which contributes to the role's ambiguity. One example of a data engineering activity is shown in the figure below.

Software Engineering Flow.png





Why Data Engineering is Popular now?

We have all heard or read about Gartner's 2017 finding that 85% of big data projects fail. This was mostly caused by the absence of trustworthy data infrastructures. Data could hardly be relied upon for important business choices. 2019 has come and gone, and nothing has changed. 87% of data science projects, according to IBM's CTO, never reach production. Gartner reaffirmed its forecast that only 80% of initiatives would fail as of this point. Similar figures were produced by a New Vantage Report.


Why is this, then?

The majority of businesses have finished their digital transformations over the past ten years. As a result, there are now unfathomable amounts of new types of data being generated at an increased rate. While the need for data scientists to make sense of it all was previously clear, it was less clear who would need to manage and guarantee the quality, security, and accessibility of this data so that data scientists could perform their duties.


Data Scientists were frequently asked to provide the necessary infrastructure and data pipelines in the early days of big data analytics in order to complete their tasks. Their expectations for the work and skill sets did not necessarily include this. As a result, data modeling would not be carried out properly. Data scientists wouldn't use data consistently, and there would be duplication of effort. Companies failed because these problems made it difficult for them to get the most out of their data efforts. Additionally, it resulted in a high turnover rate for data scientists that continues to this day.


The Internet of Things, the flood of completed corporate digital transformations, and the rush to become AI-driven make it abundantly evident that businesses today require a wealth of Data Engineers to lay the groundwork for fruitful data science endeavors.

Because of this, the significance and scope of the function of the data engineer will only increase. Businesses require teams of individuals whose main responsibility is to process data in a way that enables them to derive value from it.

Skills Needed

  • Foundation software engineering – Agile, devOps, architecture design, service oriented architecture.
  • Distributed systems – This would include software engineer skills and software architect skills.
  • Open Frameworks – Apache Spark, Hadoop, perhaps Hive, MapReduce, Kafka and others…
  • SQL – This is a database staple and remains that way.
  • Programming – Python has become the favored language for working with data. Java on the other hand, while still widely sought has fallen out of favor with most data scientists and engineers. Scala is another language that Apache Spark and Kafka are based on.
  • Pandas – a Python library for cleaning and manipulating data.
  • Visualization/dashboards
  • Cloud platforms – AWS is probably the most prevalent cloud skill set for Data Engineers to know. Google Cloud Data Engineering and Microsoft Azure are right behind.
  • Analytics – While mainly the realm of data scientists, statistical analysis skills or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
  • Data Modeling – Data modeling knowledge is quite important now in the sense that a Data Engineer needs to know how they are going to structure tables, partitions, where to normalize and denormalize data in the warehouse, etc. and how to think about retrieving certain attributes.