Why data governance is essential for enterprise AI
The recent success of artificial intelligence based large language models has pushed the market to think more ambitiously about how AI could transform many enterprise processes. However, consumers and regulators have also become increasingly concerned with the safety of both their data and the AI models themselves. Safe, widespread AI adoption will require us to embrace AI Governance across the data lifecycle in order to provide confidence to consumers, enterprises, and regulators. But what does this look like?
For the most part, artificial intelligence models are fairly simple, they take in data and then learn patterns from this data to generate an output. Complex large language models (LLMs) like ChatGPT and Google Bard are no different. Because of this, when we look to manage and govern the deployment of AI models, we must first focus on governing the data that the AI models are trained on. This data governance requires us to understand the origin, sensitivity, and lifecycle of all the data that we use. It is the foundation for any AI Governance practice and is crucial in mitigating a number of enterprise risks.
Risks of training LLM models on sensitive data
Large language models can be trained on proprietary data to fulfill specific enterprise use cases. For example, a company could take ChatGPT and create a private model that is trained on the company’s CRM sales data. This model could be deployed as a Slack chatbot to help sales teams find answers to queries like “How many opportunities has product X won in the last year?” or “Update me on product Z’s opportunity with company Y”.
You could easily imagine these LLMs being tuned for any number of customer service, HR or marketing use cases. We might even see these augmenting legal and medical advice, turning LLMs into a first-line diagnostic tool used by healthcare providers. The problem is that these use cases require training LLMs on sensitive proprietary data. This is inherently risky. Some of these risks include:
1. Privacy and re-identification risk
AI models learn from training data, but what if that data is private or sensitive? A considerable amount of data can be directly or indirectly used to identify specific individuals. So, if we are training a LLM on proprietary data about an enterprise’s customers, we can run into situations where the consumption of that model could be used to leak sensitive information.
2. In-model learning data
Many simple AI models have a training phase and then a deployment phase during which training is paused. LLMs are a bit different. They take the context of your conversation with them, learn from that, and then respond accordingly.
This makes the job of governing model input data infinitely more complex as we don’t just have to worry about the initial training data. We also worry about every time the model is queried. What if we feed the model sensitive information during conversation? Can we identify the sensitivity and prevent the model from using this in other contexts?
3. Security and access risk
To some extent, the sensitivity of the training data determines the sensitivity of the model. Although we have well established mechanisms for controlling access to data — monitoring who is accessing what data and then dynamically masking data based on the situation— AI deployment security is still developing. Although there are solutions popping up in this space, we still can’t entirely control the sensitivity of model output based on the role of the person using the model (e.g., the model identifying that a particular output could be sensitive and then reliably changes the output based on who is querying the LLM). Because of this, these models can easily become leaks for any type of sensitive information involved in model training.
4. Intellectual Property risk
What happens when we train a model on every song by Drake and then the model starts generating Drake rip-offs? Is the model infringing on Drake? Can you prove if the model is somehow copying your work?
This problem is still being figured out by regulators, but it could easily become a major issue for any form of generative AI that learns from artistic intellectual property. We expect this will lead into major lawsuits in the future, and that will have to be mitigated by sufficiently monitoring the IP of any data used in training.
5. Consent and DSAR risk
One of the key ideas behind modern data privacy regulation is consent. Customers must consent to use of their data and they must be able to request that their data is deleted. This poses a unique problem for AI usage.
If you train an AI model on sensitive customer data, that model then becomes a possible exposure source for that sensitive data. If a customer were to revoke company usage of their data (a requirement for GDPR) and if that company had already trained a model on the data, the model would essentially need to be decommissioned and retrained without access to the revoked data.
Making LLMs useful as enterprise software requires governing the training data so that companies can trust the safety of the data and have an audit trail for the LLM’s consumption of the data.
Data governance for LLMs
The best breakdown of LLM architecture I’ve seen comes from this article by a16z (image below). It is really well done, but as someone who spends all my time working on data governance and privacy, that top left section of “contextual data → data pipelines” is missing something: data governance.
If you add in IBM data governance solutions, the top left will look a bit more like this:
The data governance solution powered by IBM Knowledge Catalog offers several capabilities to help facilitate advanced data discovery, automated data quality and data protection. You can:
- Automatically discover data and add business context for consistent understanding
- Create an auditable data inventory by cataloguing data to enable self-service data discovery
- Identify and proactively protect sensitive data to address data privacy and regulatory requirements
The last step above is one that is often overlooked: the implementation of Privacy Enhancing Technique. How do we remove the sensitive stuff before feeding it to AI? You can break this into three steps:
- Identify the sensitive components of the data that need taken out (hint: this is established during data discovery and is tied to the “context” of the data)
- Take out the sensitive data in a way that still allows for the data to be used (e.g., maintains referential integrity, statistical distributions roughly equivalent, etc.)
- Keep a log of what happened in 1) and 2) so this information follows the data as it is consumed by models. That tracking is useful for auditability.
Build a governed foundation for generative AI with IBM watsonx and data fabric
With IBM watsonx, IBM has made rapid advances to place the power of generative AI in the hands of ‘AI builders’. IBM watsonx.ai is an enterprise-ready studio, bringing together traditional machine learning (ML) and new generative AI capabilities powered by foundation models. Watsonx also includes watsonx.data — a fit-for-purpose data store built on an open lakehouse architecture. It is supported by querying, governance and open data formats to access and share data across the hybrid cloud.
A strong data foundation is critical for the success of AI implementations. With IBM data fabric, clients can build the right data infrastructure for AI using data integration and data governance capabilities to acquire, prepare and organize data before it can be readily accessed by AI builders using watsonx.ai and watsonx.data.
IBM offers a composable data fabric solution as part of an open and extensible data and AI platform that can be deployed on third party clouds. This solution includes data governance, data integration, data observability, data lineage, data quality, entity resolution and data privacy management capabilities.
Get started with data governance for enterprise AI
AI models, particularly LLMs, will be one of the most transformative technologies of the next decade. As new AI regulations impose guidelines around the use of AI, it is critical to not just manage and govern AI models but, equally importantly, to govern the data put into the AI.