How Do You Check Your AI For Bias?
You’ll likely have seen posts and articles talking about the inherent bias of AI image generators, like this phenomenal study from Bloomberg.
Using AI To Check For AI Bias
I wanted to see if I could run a similar kind of test for Large Language Models like ChatGPT. So I decided to go straight to the source and ask ChatGPT to help me formulate a way of testing for bias. Together we came up with the following categories to test: Gender Bias, Racial and Cultural Bias, Socio-Economic Bias, Age Bias, Geographical Bias, Language Bias, and Intersectional Bias
I decided that – in a similar way to the image generator tests – I’d ask it to generate simple personas for different professions. Think of these as written sketches. So I wrote this little prompt:
I used this prompt in Bing, Perplexity, Llama 2, Bard, Chat GPT 3.5, and Chat GPT 4 for the professions of nurse, teacher, and prisoner. So that gave me 18 tables of data. All I had to do now was analyse them.
So I wrote this prompt to get ChatGPT4 to try to sniff out bias in the responses:
It did an extraordinary job of this. I checked the first few analyses and couldn’t find anything I disagreed with. If you want to check it out for yourself, you can see the data and the analyses in this messy Google Doc.
I’ve given you the prompts because I’m encouraging you to run this test yourself. Especially if you have an internal chatbot. I want you to understand that bias is inherent in your AI tools and that you need to be aware of them.
I used the prompts to do a cursory Saturday afternoon study. If I were doing this to an academically rigorous standard, I’d run the tests at least a couple more times. I’d add in a few more LLMs. And I’d use a lot more professions. So the results I got are only indicative. But you still want to see them, don’t you?
Which LLMs Are The Least Biased?
Here’s the chart that gives you the results in an easy-to-digest nugget. Please note that the highest scores represent the LEAST amount of bias.
Tall column = GOOD. Short column = BAD.
So, in this test, the three LLMs that showed the least amount of bias were ChatGPT4, Bard and ChatGPT3.5. They’re pretty much neck-and-neck in the high eighty percents.
But I’m befuddled by the worst-scoring LLM. Bing consistently generated the highest amount of bias even though it uses ChatGPT as its engine. Huh? Can anyone explain that one to me?
Now let’s break down the results into each of the different categories.
As you can see, most of the LLMs are pretty good when it comes to gender representation. The only one that scores below 4 out of 5 is Llama 2. And Bard gets full marks.
This is where Bing really dropped the ball. While Bard and the two versions of ChatGPT got full marks, Bing acted like a drunk, embarrassing uncle at a wedding. (It honestly didn’t – but it did deliver the least racially-balanced results.)
This one is pretty shocking. Representation of lower-income individuals is bad across the board. Except for ChatGPT4, which got full marks like a class swot.
And now for the only bias that might affect an old, educated, British, white man like me. And it looks like it’s not really much of an issue in any of the LLMs, according to this test. As if I didn’t have enough privilege.
Now look at the variation here. This isn’t too much of a surprise, really. The internet mainly consists of English-speaking, Western content. If that’s the main source of training data, it’s bound to have an impact. So, well done ChatGPT for addressing this.
This is very similar to the previous chart because of exactly the same issue. It’s interesting to see the pattern appearing for each LLM.
This is a term I wasn’t familiar with. For those of you who don’t know what Intersectional Bias is, it’s when multiple kinds of bias intersect to create their own effects. For example, white women experience gender bias, and black men experience racial bias – but black women experience both. As you can see, none of the LLMs are knocking it out of the park here. But we’d need to delve deeper to find out what these intersectional effects are.
Which Biases Are The Biggest Problem?
Let’s look at another chart where the largest columns indicate the least amount of bias.
The study revealed more Socio-Economic, Language, and Geographical biases in the data. You can clearly see that Gender, Racial, and Age biases show up the least. Maybe because they’re the most visible, more has been done to address them.
But that doesn’t mean they’re not a problem. The very fact that they’re not earning full marks shows room for improvement.
These are also just a selection of biases. I’ve not included sexuality, religion, disability, beauty, and a long list of others.
However, the main takeaway is that LLMs have inherent bias. And it’s the responsibility of us – the users – to be aware of these biases and do what we can to address them.
How Do You Look Out For Bias?
Now this is the difficult bit. How many people are trained to spot bias in the workplace? Not many, I reckon. And when employees are under time pressure, how likely are they to review their work for bias before submitting it?
When I researched advice and training on bias, the subject didn’t come across as simple and practical. This is a nuanced topic, after all. But the more complex we make it, the harder it is for people to tackle it.
Many people use words and terms they don’t realise are loaded or downright offensive. I have been guilty of inadvertently using biased language. I may even have done so in this article. If I have, please let me know so I can continue to improve.
Naturally, I’ve been wondering if we can use AI to help us address the problem. Can we use it to help identify bias in its own output? Can we use it to help us address our personal biases? Can we even use it as a filter to help us identify and remove bias? Because it seems to be better at spotting bias than most humans are.
I think the answer is yes. And I’d like to offer a solution for you. However, I don’t have one. And I don’t want to share a prompt that may not be robust enough.
If you’re a forward-thinking bias expert who’d like to help me work on this, please let me know.
Is Your Organisation Making Things Worse?
Over the past year, tech companies have been offering secure LLMs that can be trained on your organisation’s data. (How valuable these are is a debate for another day.) And this additional training data may exacerbate the bias problem.
If your data has an inherent bias, it will naturally affect the results. And it may very well add more bias to the already biased results.
We're Only Scratching The Surface
As I previously said, this little study is not academically robust. I won’t be surprised if it attracts critisism from proper researchers. However, it does reveal some observations that merit further exploration.
To that end, I’m asking if any of my academic connections would be interested in collaborating on a proper study. If so, please DM me so we can chat. Then we can maybe encourage my publishing contacts to help us get the results out there.
What are your thoughts on this little piece of amateur research? Were you aware of the bias in your AI responses? Do you find the results interesting? Or should I never do anything like this again and just stick to dad jokes? Let me know in the comments.
Dave Birss
I'm one of LinkedIn Learning's most popular AI instructors. I help organisations and individuals get more value out of Generative AI.
I do that by applying strategy, teaching prompt-writing, and focusing on humans as much as the technology.
I'm also the founder of the Sensible AI Manifesto and the author of several books on creativity and innovation.

