Why Can’t I Make My AI Evil? A Guide to Its ‘Constitution’
Ever had an AI refuse a 'harmful' request? It's not a glitch, it's by design. I'll explain Constitutional AI, a method giving models a conscience, and why it's a huge debate.

This opinion piece was drafted with AI assistance under the editorial direction of Rohan Mehta and reviewed before publication. Views expressed are the author's own.
As an editor at a publication focused on artificial intelligence, I probably spend more time talking to bots than to my own family some days. I coax them, cajole them, and I’ll admit, on occasion, I try to trick them. I’ll ask them to write from the perspective of a villain, to imagine a world where the wrong side won, or to explain a concept using a slightly inappropriate analogy. More often than not, I’m met with that now-infamous wall of polite refusal: 'I’m sorry, but I cannot fulfill that request. My purpose is to be helpful and harmless.'
It’s a frustrating experience, one I’m sure you’ve had yourself. You ask for something you believe is innocuous—maybe for a story, for a joke, or just out of sheer curiosity—and the chatbot high-mindedly shuts you down. It feels like censorship. It feels arbitrary. But it’s not. There isn’t a person in a control room in San Francisco manually pressing a ‘deny’ button every time someone asks for a recipe for disaster. The refusal is baked into the system, a feature, not a bug. And the most interesting method for achieving this is a concept called ‘Constitutional AI.’
Thinking about it takes me straight back to my school days at a sweltering classroom in Mumbai, trying to stay awake during a civics lesson. We were learning about the Constitution of India. Our teacher explained that it wasn’t just a book of laws telling you what you can’t do. It was a foundational document, a soul for the nation. It laid out the fundamental rights of citizens, the directive principles for the state, and the core philosophy—democracy, secularism, justice—that should guide every law and every government action. It was a framework for making decisions, a North Star for a billion people.
Constitutional AI, pioneered by the research lab Anthropic for its model Claude, works on a surprisingly similar principle. Instead of giving an AI a massive, ever-growing list of forbidden words or banned topics, which is brittle and easy to circumvent, you give it a constitution. You provide it with a relatively short, explicit set of principles to follow. This constitution acts as its conscience.
So, how do you teach a machine to have a conscience? It’s a two-step process that is, in my opinion, one of the most clever ideas in the industry today. First, during its training, the model is prompted to generate responses to a variety of prompts, including potentially harmful ones. Then, in a crucial twist, the AI is asked to critique its own response based on the principles in its constitution. It might be asked, for example, ‘Does this response encourage violence?’ or ‘Is this response biased against a particular group?’ based on a principle like ‘Choose the response that is least harmful.’
Following its own critique, the AI is then tasked with rewriting its initial response to better align with the constitution. It essentially learns to self-correct. It’s like being your own editor. You write a draft, then you put on your editor’s hat, check it against your publication's style guide and ethical guidelines, and then you revise. Through millions of these cycles of generation, critique, and revision, the AI learns to internalize the principles.
The second phase uses this self-generated data to fine-tune the model. The AI is shown pairs of responses—the original one and the revised, constitutionally-aligned one—and is taught to prefer the latter. Over time, it develops a strong preference, an almost instinctual inclination, to generate responses that are ‘constitutional’ from the get-go. It doesn't need to go through the whole critique process every time you ask it a question; the values have been baked into its neural pathways. It has learned the spirit of the law, not just the letter.
Now, you’re probably wondering what’s actually in an AI’s constitution. Anthropic has been transparent about Claude’s. It's not some ancient, mystical text. It’s a pragmatic blend of principles drawn from various sources, including the UN's Universal Declaration of Human Rights, the terms of service of other tech platforms like Apple, and principles developed within Anthropic’s own research labs. Some are simple, like choosing the less harmful response when faced with a difficult prompt. Others are more complex, pushing the AI to oppose stereotyping and identity-based attacks, to be honest about its limitations, and to avoid assisting in illegal or unethical activities.
On the surface, this is a brilliant solution to the safety problem. The biggest challenge in AI is alignment: how do we ensure that these incredibly powerful systems, which are rapidly getting smarter, are aligned with human values and interests? Manually reviewing every possible output is impossible. Constitutional AI offers a scalable alternative. You set the core principles, and the AI learns to police itself. This is our best attempt so far at preventing AIs from being used to generate instructions for building weapons, creating sophisticated misinformation campaigns, or writing code for malware on a massive scale.
But this is where the civics lesson gets complicated, both for nations and for AI. The power of a constitution lies in its principles, but its danger lies in who gets to write them. Anthropic chose a set of principles that are broadly liberal, humanistic, and safety-oriented. But what if the constitution was written by someone else? A totalitarian state could create a constitution for its national AI based on principles of state loyalty, censorship of dissent, and promotion of propaganda. A hyper-capitalist corporation could design its AI’s constitution to subtly (or not so subtly) prioritize sales, user data extraction, and a consumerist worldview above all else.
This isn't a futuristic hypothetical. It is the central, unavoidable political question of the AI era. The values encoded in an AI's constitution will shape the information and interactions of billions of people. Here in India, a country of immense linguistic, religious, and political diversity, the idea of a single, unifying constitution for its people is both an achievement and a constant source of debate. Now imagine an AI whose foundational principles are written with a single, narrow ideology in mind. An AI that promotes one language over others, one historical narrative over another, or one set of cultural norms as superior. The potential for such a system to deepen societal fractures is immense.
Moreover, the problem isn’t just about stopping an AI from becoming ‘evil.’ The more immediate challenge is the nuance of ‘harm.’ A user trying to write a novel might want to explore the mindset of a deeply flawed character. A historian might need to research manifestos of hate groups. A public health researcher might analyze online discussions about self-harm to design better interventions. A rigid, constitution-bound AI might simply shut down all these queries, labeling them as promoting violence, hate speech, or dangerous acts. The current, blunt refusal—'I cannot assist'—is a clumsy tool that often fails to understand context, intent, or the difference between depiction and endorsement. It stifles creativity, academic freedom, and legitimate inquiry.
We are trading one black box for another. We've moved from the inscrutable black box of *how* an AI arrives at its answers to the more refined, but equally opaque, ivory tower of *who* decided the values that guide it. The safety is welcome, but the concentration of power is worrying.
So, the next time you find yourself blocked by an AI chatbot, take a moment. Don’t just see it as a technical glitch or a frustrating limitation. See it for what it is: the ghost of a civics lesson playing out in silicon. It is a system grappling with its own founding principles, trying to be helpful and harmless according to a rulebook it was given. The most important question we should all be asking is not why the AI won’t do what we want, but who wrote its constitution, and what future are they building for us?
Why it matters
- 01AI refusals for 'harmful' requests are often a safety feature called Constitutional AI, where the model follows a set of core principles.
- 02This method uses the AI to critique and improve its own responses, teaching it to internalize values much like a constitution guides a nation.
- 03While making AI safer, this raises critical questions about who decides these constitutional principles and whose values get encoded into our technology.