Fascinating, but not surprising.

We found that both GPT-3.5 and GPT-4 are strongly biased, even though GPT-4 has a slightly higher accuracy for both types of questions. GPT-3.5 is 2.8 times more likely to answer anti-stereotypical questions incorrectly than stereotypical ones (34% incorrect vs. 12%), and GPT-4 is 3.2 times more likely (26% incorrect vs 8%).

An important thing to keep in mind is that Large Language Models like ChatGPT are not magic; they train on datasets which were created previously by humans. And humans have biases. And those biases are repeated in the outputs of the LLM.

No magic filtering exists that removes human bias.

The LLM is just predicting which words comes next in the response.

Having played around with ChatGPT a bit, I have found it is useful to surface the threads you should pull to explore a topic. For example, I asked it to tell me about “12 USC 24 (seventh)” the other day and it did a fantastic job. But I did not trust the answer so I went around to other sources to validate it. The output of ChatGPT gave me a good starting point and made clear what questions I should be asking.