IT executives need to decide how far data should travel before it’s streamlined and analyzed. The two most practical choices each have strengths and drawbacks.
On the other hand, having to sift through raw data can slow down analysis, and data lakes inevitably store data that ultimately isn’t needed.
For Patricia Florissi, global chief technology officer for sales and distinguished engineer at EMC, the pros outweigh the cons.
“You should be able to do analytics without moving the data,” she says.
In its data lake solutions, EMC stores raw data from different sources in multiple formats. The approach means that analysts have access to more information and can discover things that might get lost if data was cleaned first or some was thrown away.
Florissi adds that big analytics efforts might require multiple data lakes.
Media conglomerate AOL also uses data lakes, says James LaPlaine, the company’s chief information officer. The company engages in billions of transactions per day, and “the time it takes to copy huge data sets is a problem,” he says. Leaving data in native formats and moving it from the point of capture directly to public cloud avoids the cost of copying it over the internal network.
We want all of our rich data in one place so we can have a single source of truth across the company.
What Kind of Database to Use
It’s important to choose the right database for an analytics project, with factors like data quantity, formatting, and latency all playing a role.
The project where Intel switched databases involved an advanced query “using data from a bunch of noncorrelated sources,” Safa says. Running on an SQL database, the query took four hours. On an in-memory database, the same query took 10 minutes. But he notes that doesn’t make in-memory the right choice for every application. It always comes back to the business goals for the task at hand.
As a starting point, Safa says, consider whether a project is looking for patterns or requires pinpoint accuracy.
Distributed databases like Hadoop that store data in different formats work well for projects focused on finding trends, he says. In these cases, a few inaccurate data points won’t materially change the result.
On the other hand, he says, “If you are trying to determine where specific materials are at a given moment in your manufacturing process, then you need 100 percent accuracy with no latency.”
That requires a database with more structure or controls, tuned for real-time results. Depending on its specific needs, a company might choose an in-memory data-processing framework or a performance-focused NoSQL database. Although many analytics database types have overlapping capabilities, their features are materially different.
Data classification is...labor-intensive, but it’s an important thing to get right.