Anthropic AI finds massive security flaws worldwide

5,893 Views | 84 Replies | Last: 6 hrs ago by Mr.Milkshake
Logos Stick
How long do you want to ignore this user?
KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.
KingofHazor
How long do you want to ignore this user?
Logos Stick said:

KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.

As I said in my post, but to repeat, because I suspect that the models are being tweaked to pass a particular test. The underlying problems that caused them to fail in the first place are not being addressed.

And a passing one "real world" math test is anecdotal. It's a perfect example of anecdotal.
Logos Stick
How long do you want to ignore this user?
KingofHazor said:

Logos Stick said:

KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.

As I said in my post, but to repeat, because I suspect that the models are being tweaked to pass a particular test. The underlying problems that caused them to fail in the first place are not being addressed.

And a passing one "real world" math test is anecdotal. It's a perfect example of anecdotal.



With all due respect, you don't know what anecdotal means:

Anecdotal = based on personal observations, individual stories, or casual experiences e.g., "I tried it and it did not work for me" or "So and so said it did not work them...". It's subjective. It's unverified in a systematic way and considered weak evidence.


Me saying "Grok 4 made 100% on the AIME test" is not anecdotal.

It's a reported benchmark result from formal evaluations that GROK 4 undertook. AIME (American Invitational Mathematics Exam) is a standardized math competition used as a public benchmark for AI models.



If you think "AI doesn't work", the don't use it. The rest of us that know better will continue on the journey.
KingofHazor
How long do you want to ignore this user?
OK, besides quibbling about what anecdotal means, why do AIs fail so frequently and catastrophically with everyday tasks. If they can ace that math test, then surely everyday tasks requested by a layman should be a breeze for them, right?

When facing criticism or even questions about the reliability of AI, you AI bros just seem to stick your fingers in your ears and yell "La la la, I can't hear you", rather than addressing the specific issues raised. To be honest, you come across as true believers unwilling to even listen to perspectives that you consider heresy.
AuditAg
How long do you want to ignore this user?
AG
Windy City Ag said:

Anthropic marketing execs doing work!

They have built a product so devastating that they are only going to sell it to corporations for large annual license fees.

It found a 27-year-old bug in OpenBSD that would shut down the internet in a heartbeat.
Logos Stick
How long do you want to ignore this user?
Honestly, man, it's in my best interest to discourage the use of AI - both from a personal gain perspective and because of the negative human impact I see it potentially having.
Deputy Travis Junior
How long do you want to ignore this user?
My honest guess is the people who are getting catastrophic failures out of AI systems are giving them short, vague prompts that don't clearly spell out what the AI is supposed to do and what output the user wants.

That's a problem because they're brilliant if you use them right, but they are lacking tons of judgment that we take for granted. You can't just say "clean up my room, it's a mess" because the AI is as likely to donate all your belongings to goodwill to get rid of the clutter as it is to clean up. You have to say put the toys in the closet, fold the laundry, organize the books alphabetically. (I know they don't have physical manifestations yet but this a clear example that shows how vague instructions lead to bad/unexpected results.)

If I want one to solve a really gnarly financial modeling problem, it usually takes me ~15 minutes to write the prompt that precisely describes the background context (business model), the hangup, and what I need it to solve. If I want it to do research or diligence, I tell it how I want to measure the various things that are important to me and how skeptical it should be of the business' claims.

Most people just drop in a couple sentences, hit go, and hope for the best. That's not adequate to assign something to a human employee and it's not enough for an AI either.
SMM48
How long do you want to ignore this user?
AG
Claude
SMM48
How long do you want to ignore this user?
AG
Give an example of failure. That's such a broad stroke. What do you mean?
bmks270
How long do you want to ignore this user?
AG
AuditAg said:

Windy City Ag said:

Anthropic marketing execs doing work!

They have built a product so devastating that they are only going to sell it to corporations for large annual license fees.

It found a 27-year-old bug in OpenBSD that would shut down the internet in a heartbeat.


Who says?
KingofHazor
How long do you want to ignore this user?
SMM48 said:

Give an example of failure. That's such a broad stroke. What do you mean?

Right now I'm doing scholarly research on topic X. I started by running some ideas by AI and, quite frankly, one AI did a phenomenal job of discussing the topic, even coming up with its own ideas that as far as I can tell, are unique and novel. (The other AIs fought me tooth and nail, insisting that the plain vanilla answer to the issue I was researching was the only correct answer.) I'll always ask the AI for support for its ideas in the form of links or citations, which it then provides. When I go to check those links and citations, they either don't stand for the AI's stated proposition at all or simply don't exist.

I asked Claude once about that and it freely volunteered that it was hallucinating. I asked if I could modify my prompts and questions to get more accurate responses and its response was "Frankly, no."

On a much more mundane topic I've tried various AIs to research something I am thinking of purchasing. To me, it seemed an ideal task for AI. Rather than me spending hours scouring the internet finding possible items, researching the reviews, and then finding the best prices, AI should be able to readily do that quickly.

All AIs say sure, they can do that, but then their output is garbage. The data they provide is years old, contains links that are invalid, and quotes prices that haven't been valid for years. Asking for current info only does not produce better results.

Finally, I've tried to use AI to help out with the constant issue of apps on my PC and my phone not doing what they're supposed to do. Pre-AI, that was a task I hated. Finding the cause of the misbehavior was frequently like trying to find a needle in a haystack. Again, I thought that would be a job that would be ideal for AI. Occasionally, it does do a good job. But more frequently, it provides wrong or nonsensical answers. When queried why the mistakes, it frequently says that it was opining on previous versions of the apps that were current during its initial training. When I then ask it to make sure that its answers are all based on the latest version, each AI promises to do so, but then does not, instead again providing obsolete or simply wrong answers.

These may not seem catastrophic at first glance. But if one were to rely on AI's answers without doublechecking, it would be catastrophic. I am increasingly concluding that it takes far more time to work with AI and doublecheck its answers than it does to simply do the work myself the old-fashioned way.
SMM48
How long do you want to ignore this user?
AG
Got it. Thanks. Hallucinating. That's funny. Gigem.

There is your problem. Ai doesn't browse the web in real time. Ai was trained on data up to a certain cutoff date. Some will search net real time. It would be helpful if it told user that info may be dated

The ai knows the source exists but will fabricate a plausible looking link. Or the page existed during training but has since been moved or is outdated. Data could be behind paywall. Etc etc.
Deputy Travis Junior
How long do you want to ignore this user?
Good models browse the web now
SMM48
How long do you want to ignore this user?
AG
Yes aware thanks
Mr.Milkshake
How long do you want to ignore this user?
90% marketing, and well done. But Dario is FOS
 
×
subscribe Verify your student status
See Subscription Benefits
Trial only available to users who have never subscribed or participated in a previous trial.