I still use GPT-4 and Sonnet 3.5 because I haven't seen any significant advantage with the newer model that would stop me from focusing on creating well-crafted prompts. The idea of relying on the model without prompt engineering makes people expect it to understand everything automatically, but in my experience, I still have to revert to prompting to get accurate and useful responses. I'd prefer to see side-by-side comparisons that show how these models perform in real-world tasks rather than trivial tests like counting the number of 'r's in 'strawberry,' which don't really matter to most people.
Also, I definitely don't use it for coding—it's frustrating that it counts tokens while it's 'thinking,' $$$$$ like seriously, lol.
