On the Necessity and Challenges of Safety Guardrails for Deep Learning Models
In recent years, deep learning models, especially so-called transformer and diffusion models, became the powerhouses of the AI world. They demonstrated superhuman or near-superhuman performance in tasks such as natural-language processing, image generation, and others. Yet their superiority or their expressive capacity also brings trouble, as it makes them more dangerous and more entangled within our reality. In this article we address issues of safety guardrails for deep learning models, and their unique challenges of explainability.
My favorite Human Analogy is Understanding Minds and Brains
You and I often arrive at decisions — you to behave, and me to predict how you might behave — when we each have very little insight into the other’s inner control loops. And we’re often highly successful in gaining trust and safety that way, thanks to societal conventions, laws, governance — the ‘rules of the game’ that create robust norms for behaviour. The same dynamic applies to machine learning systems. Much of the time, we can’t know for sure what another person is thinking or intending. Much of the time, we don’t know for sure the internal logic by which an AI system has arrived at its decision or output. So robust safety frameworks are needed to manage — indeed, thrive with — these risks.
Let’s now talk about The Necessity of Safety Guardrails
Avoid producing toxic or offensive output: Keeping a narrow output can avoid the emitting of biased, offensive or misleading outputs, which is a clear motivation for safety guardrails when training transformer models such as GPT-4 and diffusion models generating images.
Fairness and Equity: We should ensure that AI models are designed to maintain, or potentially even enhance, the current representation of biases in the training data. These safety frameworks provide a way of enforcing fairness, so that the models are not modified or updated to produce unfair decisions that negatively affect individuals in certain ways (for example, based on race, gender, religion or sexual orientation). The use of fairness constraints and regular audits can help achieve this aim.
Trustworthiness and Reliability: If AI systems are to be broadly accepted, they should be able to function reliably, and be comprehensible (insofar as required). Safety guardrails can enhance robustness by requiring AI model outputs to behave consistently. Such consistency in model behaviour can be useful in building trust in AI models by users and other stakeholders.
Legal and Ethical: With increasing AI systems’ usage in sensitive areas like health care, finance, self-driving cars, national security, law enforcement and so on, they are supposed to fit within the legal and ethical frameworks. Safety frameworks promise that compliance to these regulations and ethical guidelines will help the system to avoid not only legal issues but also any ethical problems.
Now that we are clear about the requirements, let’s explore The Challenge of Explaining Model Outputs
While it’s impossible to do without guardrails, explaining the outputs of deep learning models — especially transformers and diffusion models — is proving hard. A deep-learning model is a black box; it’s very difficult to understand how it arrives at its decisions. Why? It starts with the fact that we can’t even easily look inside these machines. At least three factors make this matter more difficult.
Non-linearity and High Dimensionality: Deep learning models have many (literally) billions of parameters whose complex (ie, non-linear) interactions in high-dimensional spaces make it very difficult to discern how specific inputs lead to specific outputs.
No Interpretability: Unlike traditional machine learning models, which often have some notion of features and how those features contribute to making a prediction, deep learning (and in particular transformer-based) methods tend to lack inherent interpretability. Interpretability tools such as attention mechanisms allow us to view the model’s attention patterns but don’t offer any explanations for the decision that was made.
Varying Properties of Diffusion Models: Diffusion models such as generative models used for image and other data generation including instruction-following operate through repeated processes that apply various transformations to an initial, random image or sequence (such as noise), formulated as a ‘diffusion process’. Because the intermediate steps and transformations are generally uninterpretable, they offer another level of opacity in trying to explain how the output is formed.
Contextual Dependence: Transformer models are highly dependent on context for making their outputs — this contextual dependence means that small changes to the input can lead to massive changes in output, and it is not immediately obvious how the contextual cues that influence the decision of the model might be ascertained.
So how to Build an AI Safety Framework
Given these issues in explaining model outputs, a focus should be put on creating AI safety frameworks that develop the following:
1. Continual Audits and Monitoring: Developing feedback loops so AI models are continually monitored and audited, including addressing unintended harmful or biased outputs through user reports of problems.
2. Transparent Reporting: Offering transparency reports that detail the model’s training data, biases and limitations provides users with context about the model, which can help them to address possible risks.
3. Frameworks for Ethical Governance and Use: Provision of ethical frameworks and governance structures ensures that new AI models are developed and deployed responsibly, such as the establishment of ethics boards or another such regulatory body.
4. Bias Mitigation Strategies: Techniques like adversarial training and fairness constraints can be used to mitigate the effect of biases in training data and model outputs.
5. Explainability Tools: Explainability tools can play an important role in understanding proprietorial deep learning models. Examples include feature importance analysis, saliency maps and model-agnostic interpretability tools.
In The End
We’re going to need guardrails around deep learning models, because deep learning models — especially larger and more sophisticated ones like transformers and diffusion models — are getting so much more complex that ad-hoc explanations aren’t even feasible. We don’t even have entirely good ways to do that. So what can we do to make safety and control mechanisms as robust as possible, to bring it back to a human societal structure example? Luckily, human corporations do have robust safety and control mechanisms — despite all their limitations. We need to focus on transparency, accountability and ethical compliance to bring those aspects more and more into AI technologies so that people can trust them to be deployed in a safe way. We can become more and more AI-savvy and eventually develop a system of checks and balances based around AI safety principles, so that we can continue to harness the raw power of these models, but then verify that AI is actually doing what we want it to, and is safe to continue doing it.
Comments
Post a Comment