Machine learning doesn’t always have to be an abstruse technology. The multi-parameter and hyper-parameter methodology of complex deep neural networks, for example, is only one type of this cognitive computing manifestation.
There are other machine learning varieties (and even some involving deep neural networks) in which the results of models, how they were determined, and which intricacies influenced them, are much more transparent.
It all depends on how well organizations understand their data provenance.
Comprehending just about everything that happened to training data for models, as well as that for the production data models encounter, is integral to explaining, refining, and improving their results. Organizations that do so dramatically increase the business value such models produce.
More importantly, perhaps, they also further this technology’s fairness, accountability and transparency, making it more reliable and better for society as a whole.
“That’s why you need fine-grained understanding that is upstream and downstream of what’s happening with the data, to be able to do machine learning responsibly going forward,” acknowledged Joel Minnick, Databricks VP of Marketing.
Cataloging Data Lineage
Information about models’ training data and production data might involve data sources, transformation, specific integration techniques, and more. Accomplished data catalog solutions can capture this data—in real-time—so organizations can always look back at what took place to understand how models are performing. Thus, data scientists are “able to get the context around this set of data that I’m going to use in my model,” Minnick explained. “You know, where did this data come from? What other data did we get from it? When was it generated? So that I have a lot better understanding of how should I use this data.”
Data lineage is comprised of metadata, which data catalogs specialize in storing about datasets. Catalogs also enable users to impute tags and other descriptors as additional metadata, some of which is helpful for data provenance and establishing trust in data. Effective ones do so while connecting to a range of platforms, including those for data scientists, data engineers, and end users, via what Minnick characterized as an “API-driven service.”
Data Governance for Data Science
The improved traceability about how training data and operational data is impacting machine learning model results extends data provenance into the realm of data science. Consequently, this dimension of data governance naturally expands to numerous data science platforms for creating and deploying these models. “It’s about tables and files for sure, but also, being able to govern notebooks,” Minnick remarked. “Being able to govern dashboards. These are more modern ways that data gets produced and consumed.” This sentiment is particularly true for data scientists building models in notebooks, as well as those monitoring the results of their outputs via dashboards.
Still, simply garnering this data lineage in data catalogs that are connected to a robust data science toolset via APIs is only aspect of transparently utilizing machine learning. Employing this information to improve models’ outputs involves calibrating them according to what’s ascertained through data lineage. For example, detailed traceability of how data were manipulated for models enables data scientists to “be able to understand how I can separate out some data if some of it is problematic,” Minnick noted.
Logically, those employees could then apply this knowledge to see why specific data types were problematic, so they could either correct them or boost models’ accuracy by removing them altogether. According to Minnick, more organizations are realizing this benefit of applying data lineage to model results “partly because of the rise of machine learning and AI inside of every industry today. It’s becoming more and more prevalent. When we released our AutoML product last year, one of the terms we tended to use was glass box. It’s the same kind of idea [for data provenance].”
Regulatory Consequences and More
Organizations also enhance their regulatory compliance capabilities with the surer understanding of the results of adaptive cognitive computing models that data lineage provides. Industries like finance, healthcare, and others are highly regulated, requiring companies to clearly illustrate how they arrived at decisions for their clients. Data provenance creates a figurative road map of data’s journey for building machine learning models and comprehending their results—which is invaluable for actually proving compliance to regulators.
This information also aids internal audits, allowing companies to realize which regulatory areas they’re remiss in so they can redress them to prevent violations. “Being able to demonstrate to regulators with that very granular lineage information, and again, not just across tables, but across all the different places that I use my data inside of the wider organization, is very important,” Minnick asserted. When this advantage is combined with data provenance’s propensity for increasing machine learning model accuracy, it’s quite possible this method will soon become a best practice for deploying this technology.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1