I recently caught up with Haoyuan Li, Founder and CTO of Alluxio to examine the next critical piece in the data evolution: Data Orchestration – the missing piece in building hybrid and multi-cloud analytical architectures. Haoyuan Li received his Ph.D. from UC Berkeley AMPLab, in Computer Science. At the AMPLab, he created Alluxio (formerly Tachyon) Open Source Data Orchestration System, co-created Apache Spark Streaming, and became an Apache Spark founding committer. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.
insideBIGDATA: What are some of the factors that make creating high performance hybrid/multi-cloud data analytics systems so challenging? What are some of the most common stumbling blocks?
Haoyuan Li: The first challenge we see a lot is around data locality of the hot data and warm data, which is needed to provide the data-driven application high performance. The reason is that in most cases, in a hybrid / multi-cloud environment, the data is stored remotely from the cloud compute cluster in which case, both latency and bandwidth of read or write data can be the bottleneck of the application performance.
Second is data
accessibility. Various applications may use different APIs to interact with
data. For example, many analytics applications use HDFS- compatible APIs and
machine learning applications use POSIX API. However, in a cloud environment,
the most commonly provided API is the S3-compatible API. Therefore, the
applications either can not access the data or need to be rewritten to access
insideBIGDATA: What is Data Orchestration and why do you claim it’s the next critical piece in the data evolution?
Haoyuan Li: A data orchestration platform abstracts data access across storage systems, virtualizes all the data, and presents the data via standardized APIs with a global namespace to data-driven applications.
We’re seeing an
explosion around the use of data-driven compute frameworks and data storage
systems to power today’s “data-driven” organization – without these two pieces,
you just can’t get the immediate insights that today’s organization requires.
And that’s why Data Orchestration is the next critical piece of this data
evolution. Organizations need a data platform that enables their employees to
effectively and efficiently leverage data to help them make business decisions.
This requires applications to be able to interact with the needed data in an
organization with the least amount of human intervention, and existing storage
systems or compute frameworks can’t do this…but a data orchestration platform
insideBIGDATA: What are some key components of a data orchestration layer?
Haoyuan Li: The key components are:
- Metadata management system
- Data management system
- Data orchestration policy engine
- Data operator
insideBIGDATA: Where do you see users succeed in their attempts to bring data closer to compute whether in the cloud, on-prem or hybrid cloud environments? What are some of the use cases?
Haoyuan Li: We have seen that many leading companies in financial services, telecommunication sector have successfully doing this. For example, some users keep their data on-premise, while running applications in the cloud. This is a typical hybrid environments, and users want to intelligently burst their workloads from their HDFS clusters to the cloud, but this typically means having to manage data copies and application changes. A data orchestration platform is a “zero-copy” burst solution, meaning when compute capacity is limited you can intelligently burst to the cloud without needing data copies or those app changes.
insideBIGDATA: You talk about data orchestration as being analogous to container orchestration. What do you mean by this?
Haoyuan Li: A data orchestration system provides data-driven applications the data sources by enabling them to read and write data without worrying about where and how the data is stored. A container orchestration system provides applications compute resources by enabling them to run with the machine resources they need, without worrying about where and how the compute resource is allocated.
insideBIGDATA: Where can we learn more about data orchestration?
Haoyuan Li: Alluxio Open Source software is a data orchestration system implementation for analytics and machine learning workloads. You can learn more about data orchestration through Alluxio’s architecture and case studies. In the meantime, we are hosting the first Data Orchestration summit on Nov. 7 in San Francisco Bay Area. Many pioneers and practitioners would present at the conference about their experience building data orchestration platforms.
Sign up for the free insideBIGDATA newsletter.