Data engineering allows a company to take advantage of the large quantities of data that the company has generated. In many companies, new data has been produced rapidly for many years, but the company has not been able to take full advantage of it.
Creating large data sets does not provide immediate value for a company. A company needs to perform data engineering and data science to take full advantage of it.
When data gets generated, it is stored in a database, data lake, or API backend like Google Analytics. In order to manipulate that data, it is often pulled into a data warehouse. A data warehouse provides fast access time to large quantities of data.
Pulling data from a source like a database or data lake into a data warehouse requires a process known as extract and load. Once the data is in the data warehouse, it may also undergo a transform, which enriches the data or puts it in a format that is easier to make use of. Once data is in a data warehouse, it can be used to build models, interactive dashboards, and Jupyter Notebooks.
The data engineering lifecycle has many different components, which is why data engineering can often be intimidating to a company that is trying to make use of their data. Meltano is a project with the goal of providing a system of conventions for managing the data engineering lifecycle. Meltano was started by GitLab, and the Meltano project has some strategic similarities to GitLab.
Danielle Morill is the general manager of Meltano at GitLab. She joins the show to discuss the world of data engineering, and the architecture of Meltano. We touch on the different components of a data engineering pipeline, and the most acute pain points for data engineers.
ANNOUNCEMENTSThe post Meltano: Data Engineering Lifecycle with Danielle Morrill appeared first on Software Engineering Daily.