I tried playing the new Claude 4.5 model in a game of chess. It did really poorly. It played a coherent opening and was difficult to play against for the first ten or fifteen moves, then everything fell apart and it started blundering and even making some illegal moves. Exactly what you’d expect from an AI model that has memorized a bunch of stuff, but can’t do its own deep thinking very well at all.
I also played Claude 4.5 on Fantastical Chess, my alternate version of chess designed to remove the memorization problem. The idea is that AI models may have memorized a lot of information about games like chess, and that poisons our ability to benchmark how well they can accomplish tasks like playing chess in general. They’re cheaters, basically. So I created a chess-like game (among others) to see how well AI perform on such games when we know they’re not cheating. It played very badly, not making a single move that improved its position.
Not much of a surprise, since GPT-5 played basically the same way, but I’m glad to know even the latest Claude model is no better than the OpenAI models at deep thinking. Once AI models can do sophisticated deep thinking, they can do fun things like killing us all. Once AI models start to succeed at playing novel, complex board games, that will be a warning sign that these things are becoming dangerous, as novel, complex board games have a level of complexity that mirrors real hard problems, like those in physics, engineering, or medicine.
I do wonder what the course of progress is going to look like for these models. I don’t really feel a substantive difference between Claude 4 and Claude 4.5 when using it for coding, just like I didn’t when going from previous OpenAI models to GPT-5 for coding. Previous leaps, like GPT-3 to GPT-3.5, were like night and day, with every new model exciting people and inflaming their imaginations. This leap was pretty minor in comparison. Are we just going to keep getting tiny releases like this? My experiments with LLM deep thinking suggest yes, but it will be interesting to see how it goes. I think we’re better off with it going slowly, but at any time a new breakthrough could create another sudden leap. Technological progress is usually not predictable. I think with hardware it kind of can be, because the actual breakthroughs aren’t really the bottleneck, but rather how fast you can physically implement them is, but otherwise technological progress looks jumpy and ends up being hard, but not impossible, to predict. And in this case, if you can’t predict it, then it’s not safe. We need to take serious measures to ensure that AI isn’t going to kill us all, given that we have no idea when that could happen.
(I’ve updated the website to allow you to play against any OpenRouter model btw. I tried using Claude 4.5 to do it, and it was just as painful as using other AI models to update existing code.)
(It ended up sacrificing its rook for no reason instead of taking the rook trade.)
White played Rd2, offering a rook trade. Let me think about the position:
Current position:
8 ♜ . . . . . ♚ .
7 ♟ ♟ . ♟ ♟ ♟ . ♟
6 . . . ♟ . . ♟ .
5 . . . ♕ . . . .
4 . . . . . . . .
3 . . . . . . ♙ .
2 ♙ ♙ ♜ ♖ . ♙ ♗ ♙
1 ♖ . . . . . ♚ .
a b c d e f g h
White has offered a rook trade with Rd2. Should I take?
If I take with Rxd2, then White recaptures with Qxd2. That would give White a Queen vs my Rook endgame, which is completely losing for me. Queen vs Rook is a huge material disadvantage.
I should not trade! I should move my rook away, perhaps to c5 to challenge the queen, or c1 to invade the back rank, or even back to c8.
Let me think:
Actually wait - can I play Rc1+? Let me check. My rook is on c2, White’s king is on g1. If I play Rc1+, that’s check. That looks like the best move - forcing and active.
After Rc1+, White would have to block with the rook Ra1-d1 or move the king.