This Is A Case Study About How Cox Automotive Solved Data Drift And ETL problems

Dew ArticlesJanuary 31, 2022

382 4 minutes read

There is so much “data drift” these days that only about one fifth of the time that a data analyst spends actually looking at the data. The rest of the time is spent “wrangling it into shape and getting it from where it is to where it will be used.” Patterson and Michael Gay, a technical architect at Cox Automotive, talked about how StreamSets helped Cox Automotive deal with data drift at the Enterprise Data World Conference. They talked about how StreamSets helped Cox Automotive build an enterprise data lake. (data science in Malaysia)

People who work for Cox Automotive work for a lot of different companies in the automotive field. Companies like Kelley Bluebook, Autotrader, VinSolutions, NextGear, and companies from China and the UK are on the list. Patterson said that:

All kinds of things: We buy and sell cars, move them around, do maintenance and scheduling, and do all kinds of other things with cars.

The problem is that data can move around, which is bad for us. (data science in Malaysia)

Cox has a big advantage because it can share data from different parts of the same industry. As their website says, “data is the point of integration for all 25+ companies.” But the way the data was shared was not very good. Gay explained how the situation was:

A company called Kelley Blue Book (KBB) would ask Autotrader for information about cars. In the next step, Autotrader would ask VinSolutions for a dataset. Then, KBB would also ask VinSolutions for the same dataset. For example, Autotrader would ask KBB to get VinSolutions’ dataset from them. So, there was this big spider web of overlap, but they were never the same.

There are three types of data drift that could happen: structure drift, semantic drift, and infrastructure drift. Extract, transform, and load (ETL) processes would change or modify the data. There are three types of data drift that could happen: Patterson told us:

Before there was modern ETL, you would build a map of all the fields that came in and how they had to be changed. If you were lucky, that map would stay up to date for a few weeks at the most.” Since then, the pace of change has sped up.

Data Source (data science in Malaysia)

At the same time, the number of data sources has grown. The data comes from devices, log files, and click-streams, and it’s much more diverse than the traditional databases that are in your company. Patterson said that. When new latitude and longitude columns are added to a customer address, for example, they change the schema, which Patterson calls “structure drift.” This is what he calls “structure drift.”

“Semantic drift,” he said, “is a little bit more subtle.” That’s when the structure doesn’t change, but your interpretation of the data does. This is an example of how having zip codes in a numeric field can be bad. For example, when a company starts selling outside of the United States, the field has to be alphanumeric. When you move that data around, what’s going to happen Does it still work, or has any component made an assumption about how the data should be there? Patterson said that he would like to know.

“Infrastructure drift” happens when some parts of a chain are updated, and then there is a change in the log files that are linked to them. “The best thing that could happen is that data drifts and something breaks, and you know about it. Worst case scenario: “It just slowly changes the data.”

Filling the Data Lake: Finding a Way to Do It

In order to keep all of the data assets from all of the business units in one place, Cox built a data lake. All these teams come to this one place to look at and share their data. People use Hive, Spark, or MapReduce to get to it in a Cloudera Hadoop cluster that they can use now. There is a lot of work involved in getting data from 25+ different companies into one place, though. This is what Gay said:

There is one Oracle system in Autotrader that has more than 1600 tables. If we were to write a custom scoop job to get this data, we’d have to do it 1600 times.” Our best guess was that it would take a developer about six hours to build a full workflow.

Gay talked about the data lake and a tiny pinprick-sized dot on it to show how much data they’ve been able to put into the data lake so far. After a lot of testing, they found out that the custom tool “just didn’t work, and we couldn’t do half of the things we wanted to do.” “So, we went looking for a new way to get data in.”

There were eight tools they tried in their search for the right Hadoop platform. Gay then looked at them all and decided which one was the best. Each tool was ranked based on how well it could help with strategy, data architecture, operations management, development, and quality and monitoring features. Knife, Gobblin, RedPoint Global, Informatica, StreamSets Data Collector, and Informatica were some of the tools they looked at. Data Collector was the best choice because they looked at how it would be used in those situations.

StreamSets Data Collector: The Answer

Patterson says that StreamSets was started just over two years ago “with the goal of making it easier to move data between systems, with a focus on big data.” Cloudera, Informatica, Apache, Salesforce, Elastic, and Facebook are just a few of the companies that the team has worked with in the past. Data Collector was StreamSets’ first product. It was designed to make complex dataflows between any two things.

Data Collector is for “data engineers, data scientists, and developers” who want to build data pipelines to get their data from where it is to where they want it to be, Patterson said. Patterson said that Data Collector can also be used for “optional transformations along the way.” Web-based: “So it’s a Java app.” People can connect to it through a web browser, and they can build their pipelines in a way that looks like it.

In other words, “We wanted to be able to separate acquisition and ingestion.” This meant that we would be able to troubleshoot and find problems faster and not break ingestion. It turns into a black box where we always do the same thing, no matter what kind of data we have.

Source: data science course malaysia , data science in malaysia