Anthropic AI finds massive security flaws worldwide

7,463 Views | 113 Replies | Last: 2 hrs ago by dmart90
Logos Stick
How long do you want to ignore this user?
KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.
KingofHazor
How long do you want to ignore this user?
Logos Stick said:

KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.

As I said in my post, but to repeat, because I suspect that the models are being tweaked to pass a particular test. The underlying problems that caused them to fail in the first place are not being addressed.

And a passing one "real world" math test is anecdotal. It's a perfect example of anecdotal.
Logos Stick
How long do you want to ignore this user?
KingofHazor said:

Logos Stick said:

KingofHazor said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.

You guys can claim accuracy by AI all you want but my frequent use of Gemini, Claude, Grok, ChatGPT, and Elicit show that they remain replete with all kinds of errors and are absolutely untrustworthy.

The anecdotal stories of AIs failing some test and then passing it a year later with flying colors sounds like the AIs are being revised specifically to pass those tests they failed, without fixing the underlying problems that cause them to fail multiple different types of tests. It's reminiscent of stock traders tweaking their models to perform 100% of historical data, but then the models fail 100% of the time on real-time trades.



Unlike your argument - "my frequent use shows AI just doesn't work" - my post is not anecdotal. It's a real word math test that GROK4 aced. Of course the models are being improved over time. Why do you consider that a negative? LOL. Makes no sense.

As I said in my post, but to repeat, because I suspect that the models are being tweaked to pass a particular test. The underlying problems that caused them to fail in the first place are not being addressed.

And a passing one "real world" math test is anecdotal. It's a perfect example of anecdotal.



With all due respect, you don't know what anecdotal means:

Anecdotal = based on personal observations, individual stories, or casual experiences e.g., "I tried it and it did not work for me" or "So and so said it did not work them...". It's subjective. It's unverified in a systematic way and considered weak evidence.


Me saying "Grok 4 made 100% on the AIME test" is not anecdotal.

It's a reported benchmark result from formal evaluations that GROK 4 undertook. AIME (American Invitational Mathematics Exam) is a standardized math competition used as a public benchmark for AI models.



If you think "AI doesn't work", the don't use it. The rest of us that know better will continue on the journey.
KingofHazor
How long do you want to ignore this user?
OK, besides quibbling about what anecdotal means, why do AIs fail so frequently and catastrophically with everyday tasks. If they can ace that math test, then surely everyday tasks requested by a layman should be a breeze for them, right?

When facing criticism or even questions about the reliability of AI, you AI bros just seem to stick your fingers in your ears and yell "La la la, I can't hear you", rather than addressing the specific issues raised. To be honest, you come across as true believers unwilling to even listen to perspectives that you consider heresy.
AuditAg
How long do you want to ignore this user?
AG
Windy City Ag said:

Anthropic marketing execs doing work!

They have built a product so devastating that they are only going to sell it to corporations for large annual license fees.

It found a 27-year-old bug in OpenBSD that would shut down the internet in a heartbeat.
Logos Stick
How long do you want to ignore this user?
Honestly, man, it's in my best interest to discourage the use of AI - both from a personal gain perspective and because of the negative human impact I see it potentially having.
Deputy Travis Junior
How long do you want to ignore this user?
My honest guess is the people who are getting catastrophic failures out of AI systems are giving them short, vague prompts that don't clearly spell out what the AI is supposed to do and what output the user wants.

That's a problem because they're brilliant if you use them right, but they are lacking tons of judgment that we take for granted. You can't just say "clean up my room, it's a mess" because the AI is as likely to donate all your belongings to goodwill to get rid of the clutter as it is to clean up. You have to say put the toys in the closet, fold the laundry, organize the books alphabetically. (I know they don't have physical manifestations yet but this a clear example that shows how vague instructions lead to bad/unexpected results.)

If I want one to solve a really gnarly financial modeling problem, it usually takes me ~15 minutes to write the prompt that precisely describes the background context (business model), the hangup, and what I need it to solve. If I want it to do research or diligence, I tell it how I want to measure the various things that are important to me and how skeptical it should be of the business' claims.

Most people just drop in a couple sentences, hit go, and hope for the best. That's not adequate to assign something to a human employee and it's not enough for an AI either.
SMM48
How long do you want to ignore this user?
AG
Claude
SMM48
How long do you want to ignore this user?
AG
Give an example of failure. That's such a broad stroke. What do you mean?
bmks270
How long do you want to ignore this user?
AG
AuditAg said:

Windy City Ag said:

Anthropic marketing execs doing work!

They have built a product so devastating that they are only going to sell it to corporations for large annual license fees.

It found a 27-year-old bug in OpenBSD that would shut down the internet in a heartbeat.


Who says?
KingofHazor
How long do you want to ignore this user?
SMM48 said:

Give an example of failure. That's such a broad stroke. What do you mean?

Right now I'm doing scholarly research on topic X. I started by running some ideas by AI and, quite frankly, one AI did a phenomenal job of discussing the topic, even coming up with its own ideas that as far as I can tell, are unique and novel. (The other AIs fought me tooth and nail, insisting that the plain vanilla answer to the issue I was researching was the only correct answer.) I'll always ask the AI for support for its ideas in the form of links or citations, which it then provides. When I go to check those links and citations, they either don't stand for the AI's stated proposition at all or simply don't exist.

I asked Claude once about that and it freely volunteered that it was hallucinating. I asked if I could modify my prompts and questions to get more accurate responses and its response was "Frankly, no."

On a much more mundane topic I've tried various AIs to research something I am thinking of purchasing. To me, it seemed an ideal task for AI. Rather than me spending hours scouring the internet finding possible items, researching the reviews, and then finding the best prices, AI should be able to readily do that quickly.

All AIs say sure, they can do that, but then their output is garbage. The data they provide is years old, contains links that are invalid, and quotes prices that haven't been valid for years. Asking for current info only does not produce better results.

Finally, I've tried to use AI to help out with the constant issue of apps on my PC and my phone not doing what they're supposed to do. Pre-AI, that was a task I hated. Finding the cause of the misbehavior was frequently like trying to find a needle in a haystack. Again, I thought that would be a job that would be ideal for AI. Occasionally, it does do a good job. But more frequently, it provides wrong or nonsensical answers. When queried why the mistakes, it frequently says that it was opining on previous versions of the apps that were current during its initial training. When I then ask it to make sure that its answers are all based on the latest version, each AI promises to do so, but then does not, instead again providing obsolete or simply wrong answers.

These may not seem catastrophic at first glance. But if one were to rely on AI's answers without doublechecking, it would be catastrophic. I am increasingly concluding that it takes far more time to work with AI and doublecheck its answers than it does to simply do the work myself the old-fashioned way.
SMM48
How long do you want to ignore this user?
AG
Got it. Thanks. Hallucinating. That's funny. Gigem.

There is your problem. Ai doesn't browse the web in real time. Ai was trained on data up to a certain cutoff date. Some will search net real time. It would be helpful if it told user that info may be dated

The ai knows the source exists but will fabricate a plausible looking link. Or the page existed during training but has since been moved or is outdated. Data could be behind paywall. Etc etc.
Deputy Travis Junior
How long do you want to ignore this user?
Good models browse the web now
SMM48
How long do you want to ignore this user?
AG
Yes aware thanks
Mr.Milkshake
How long do you want to ignore this user?
90% marketing, and well done. But Dario is FOS
dude95
How long do you want to ignore this user?
AG
At the end of the day - the problem comes down to expectations. Random people hear the hype about what can get done and assume that means it can get done with no human work put into it, get it perfect first time through, then cure cancer and end world hunger on the coffee break. Then they get disappointed when putting in a single prompt on the website and it doesn't turn out exactly how they want it.

The people who are getting a ton out of AI are the ones working it daily. Understanding what it can do and what it can't do. Understanding what the human part of the equation is vs the role that AI can play today.

I can tell you I'm not hiring new developers for my AI company. I hire domain knowable people and show them how to 'develop'. My 30 year career has been rife with half of the work going offshore to low-cost, but really inefficient work structure. Those days are coming to an end as I can get more work done telling an AI what I need rather than having a 6AM meeting with a developer in Hyderabad.

Is it better than a senior developer here in the US? I don't think so right now. But the difference in today vs 2 years ago when I started, vs even 3 months ago is bigger than the majority of the public understands. Anthropic CEO has been quoted that 50% of white collar jobs will be done by AI in 3 years. I think it's hyperbolic - but just in terms of timeline. I feel it will be 5 years.
ETFan
How long do you want to ignore this user?
Didn't read thread, but this is #marketing
Stmichael
How long do you want to ignore this user?
AG
Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.


So the model got a hold of the answer key for a very difficult exam and knows how to regurgitate the answers? You don't say?!

You're failing to understand the underlying criticism. I fully acknowledge that AI is capable of some impressive things if it's specifically trained to do them. My criticism is that it immediately goes off the rails when it comes to putting together 2 and 2 to get 4. Hence the picture clearly demonstrating it ****s up the ABC's.

AI as it currently exists has a specific niche that people who understand it and have money to train their own model will do well with. You need to define a specific function you want it to perform, then train the model to get it to do that job quickly and consistently correctly. Otherwise, you're setting yourself up for failure.

As for everyone else talking about how much code you're having AI write for you, get as much of that done as you can while they're still giving the compute away. Even the most expensive subscription level for those AI tools doesn't cover even one tenth of what it costs to run the computers for the prompts it receives. They're in the "get the customer hooked on our service" phase, and are desperately hoping they can move to the "gouge our dependent customer and make back all the money we lost" phase soon.
Rex Racer
How long do you want to ignore this user?
AG
KingofHazor said:

OK, besides quibbling about what anecdotal means, why do AIs fail so frequently and catastrophically with everyday tasks. If they can ace that math test, then surely everyday tasks requested by a layman should be a breeze for them, right?

When facing criticism or even questions about the reliability of AI, you AI bros just seem to stick your fingers in your ears and yell "La la la, I can't hear you", rather than addressing the specific issues raised. To be honest, you come across as true believers unwilling to even listen to perspectives that you consider heresy.

AI is very far from perfect, but if you write an effective prompt, it can do some amazing work. Of course you need to verify, but you need to do that with people, too.

It's not the answer to everything, and it's not trash, either.

So many people tend to just take the extreme opposite position and argue. There is a middle ground.
Logos Stick
How long do you want to ignore this user?
Stmichael said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.


So the model got a hold of the answer key for a very difficult exam and knows how to regurgitate the answers? You don't say?!

You're failing to understand the underlying criticism. I fully acknowledge that AI is capable of some impressive things if it's specifically trained to do them. My criticism is that it immediately goes off the rails when it comes to putting together 2 and 2 to get 4. Hence the picture clearly demonstrating it ****s up the ABC's.

AI as it currently exists has a specific niche that people who understand it and have money to train their own model will do well with. You need to define a specific function you want it to perform, then train the model to get it to do that job quickly and consistently correctly. Otherwise, you're setting yourself up for failure.

As for everyone else talking about how much code you're having AI write for you, get as much of that done as you can while they're still giving the compute away. Even the most expensive subscription level for those AI tools doesn't cover even one tenth of what it costs to run the computers for the prompts it receives. They're in the "get the customer hooked on our service" phase, and are desperately hoping they can move to the "gouge our dependent customer and make back all the money we lost" phase soon.


The AIME test is created anew each year, from scratch. They don't recycle old problems. If they did, humans could do the same "memorizing". You also have a fundamental misunderstanding of how LLMs work.

Your dismissal of AIs ability to excel in a subject matter simply because the AI has been trained in that subject is so off the mark it's almost not worth addressing. If I teach a human to do algebra, then throw calculus problems in front of him, no one would criticize him for not being able to do calculus. And no one would dismiss his ability to then do calculus because he was subsequently trained in calculus. Yet that is what you are doing.
Stmichael
How long do you want to ignore this user?
AG
Rex Racer said:

KingofHazor said:

OK, besides quibbling about what anecdotal means, why do AIs fail so frequently and catastrophically with everyday tasks. If they can ace that math test, then surely everyday tasks requested by a layman should be a breeze for them, right?

When facing criticism or even questions about the reliability of AI, you AI bros just seem to stick your fingers in your ears and yell "La la la, I can't hear you", rather than addressing the specific issues raised. To be honest, you come across as true believers unwilling to even listen to perspectives that you consider heresy.

AI is very far from perfect, but if you write an effective prompt, it can do some amazing work. Of course you need to verify, but you need to do that with people, too.

It's not the answer to everything, and it's not trash, either.

So many people tend to just take the extreme opposite position and argue. There is a middle ground.


3 questions:

1) How much time are you spending crafting this very specific prompt to get the generalist AI model to do what you want, and then how much time are you spending reviewing and revising the work?

2) How much cash are you spending on the tokens input to and output from the AI? How much is that cost going to increase when these companies stop selling the compute at a massive loss?

3) What's your value add from the end product? How are you measuring it?

I'd be very excited by the increase in productivity we would be experiencing if AI was even half as capable as the AI bros were making it out to be. Increased productivity is a direct counter to inflation, and we could all use some lower prices these days. But the proof is in the pudding, and AI isn't anywhere close to what is being sold to us. Hence the MIT report that 95% of companies aren't seeing any difference from their implementation of it.

As I said above, the ones who are seeing real value from it are those who are working with a model training company to generate a specialized tool to fill a niche for them.
Stmichael
How long do you want to ignore this user?
AG
Logos Stick said:

Stmichael said:

Logos Stick said:

For the others on this board....

that image about integer arithmetic comes from a 2023 research paper titled "GPT Can Solve Mathematical Problems Without a Calculator".

Might as well publish an example from 1990.

Grok4 - released in July of last year - scored 100% on the AIME 2025. AIME is a notoriously difficult early-college level math competition exam used to qualify students for the US Math Olympiad team. GROK 4 aced it!

Your claim is just cherry picking an old, narrow benchmark to make a broad negative point about AI capabilities. It was legit criticism in 2023-2024, but it doesn't hold water with frontier 2025-2026 models.


So the model got a hold of the answer key for a very difficult exam and knows how to regurgitate the answers? You don't say?!

You're failing to understand the underlying criticism. I fully acknowledge that AI is capable of some impressive things if it's specifically trained to do them. My criticism is that it immediately goes off the rails when it comes to putting together 2 and 2 to get 4. Hence the picture clearly demonstrating it ****s up the ABC's.

AI as it currently exists has a specific niche that people who understand it and have money to train their own model will do well with. You need to define a specific function you want it to perform, then train the model to get it to do that job quickly and consistently correctly. Otherwise, you're setting yourself up for failure.

As for everyone else talking about how much code you're having AI write for you, get as much of that done as you can while they're still giving the compute away. Even the most expensive subscription level for those AI tools doesn't cover even one tenth of what it costs to run the computers for the prompts it receives. They're in the "get the customer hooked on our service" phase, and are desperately hoping they can move to the "gouge our dependent customer and make back all the money we lost" phase soon.


The AIME test is created anew each year, from scratch. They don't recycle old problems. If they did, humans could do the same "memorizing". You also have a fundamental misunderstanding of how LLMs work.

Your dismissal of AIs ability to excel in a subject matter simply because the AI has been trained in that subject is so off the mark it's almost not worth addressing. If I teach a human to do algebra, then throw calculus problems in front of him, no one would criticize him for not being able to do calculus. And no one would dismiss his ability to then do calculus because he was subsequently trained in calculus. Yet that is what you are doing.


You still fail to understand my criticism. I want you to think this through and explain it back to me: When you ask AI a question, how does it interpret what you asked it, and how does it generate the answer for you?

When you understand the explanation of that question, you'll see why I have very little faith in it to do any sort of real problem solving.
AustinAg2K
How long do you want to ignore this user?
dude95 said:

At the end of the day - the problem comes down to expectations. Random people hear the hype about what can get done and assume that means it can get done with no human work put into it, get it perfect first time through, then cure cancer and end world hunger on the coffee break. Then they get disappointed when putting in a single prompt on the website and it doesn't turn out exactly how they want it.



To be fair, this is what the AI CEOs are selling. Specifically, Dario and Altman. Both are telling everyone in the next couple of months their solutions will read your mind and give you exactly what you want with zero failures. Then the media takes that and reports on it like it's gospel instead of just CEO's marketing their products.
YouBet
How long do you want to ignore this user?
AG
Claude! said:

Everyone should definitely trust Claude implicitly.


I've always trusted you even before you were co-opted and absorbed into this realities SkyNet.
KingofHazor
How long do you want to ignore this user?
Quote:

It's not the answer to everything, and it's not trash, either.

So many people tend to just take the extreme opposite position and argue. There is a middle ground.

If my post conveyed the impression that I thought that AI is trash, then I wrote poorly. It's not trash. In my opinion, it's just not the incredible, amazing new technology that many on here are claiming. It has phenomenal potential, but until its habit of making major errors, of hallucinating, of not telling you that it cannot do something properly, and the like are worked out, it's not ready for prime time.

And I'll push back on the response that the problems lie with the user's drafting of prompts. AI is intended to help the user. Blaming the user (customer) for the failures of the product is never a good idea.

There are lots and lots of articles by people deep in the AI field talking about how AI has hit its ceiling. I don't know enough to know if they're right or wrong. But, just from my experience, AI is far from ready to take on everyday tasks for lay people.
whytho987654
How long do you want to ignore this user?
I like to live in reality, which is AI companies are in billion-dollar holes, and actual AI experts (the top engineers within anthropic, google etc who are told to shut up) are claiming AI has plateaued
KingofHazor
How long do you want to ignore this user?
Quote:

Your dismissal of AIs ability to excel in a subject matter simply because the AI has been trained in that subject is so off the mark it's almost not worth addressing. If I teach a human to do algebra, then throw calculus problems in front of him, no one would criticize him for not being able to do calculus. And no one would dismiss his ability to then do calculus because he was subsequently trained in calculus. Yet that is what you are doing.

Terrible analogy. The better analogy would be giving someone with a PhD in math a calculus problem, which they solve easily, but then they fail to solve a basic high school algebra problem.
Deputy Travis Junior
How long do you want to ignore this user?
Stmichael said:


You still fail to understand my criticism. I want you to think this through and explain it back to me: When you ask AI a question, how does it interpret what you asked it, and how does it generate the answer for you?

When you understand the explanation of that question, you'll see why I have very little faith in it to do any sort of real problem solving.


I (and I'm guessing logos stick) understand what you're saying. We just know you're wrong. Here's why: the people building these things also know that next token prediction is an inefficient, unreliable way to do math, especially advanced math. They saw this year's ago, so they started training them on "tool use" (there are now benchmarks that measure this, and it's one that is closely monitored). So now, when you ask a good model a complex math problem, it doesn't attempt to solve it via next token prediction, it just writes a python script that will handle the math. I don't know what trash model you're using that still can't handle 2+2 but that was solved years ago.

Here an anecdote: I'm a god at financial modeling. I'm the guy you come to when you want to model a merger between two companies that includes 10+ debt and equity classes, pull-ups, cram-downs, renegotiation of terms, future funding rounds, etc. all in one model, and by the way, everything needs to be formula based and adjustable so that the deal team can quickly change assumptions as the negotiations progresses. I almost never run into a problem I can't solve, but when I do, ChatGPT can handle them. Recently, it showed me how to use on-the-fly array generation (via makearray() and lambda()) + matrix multiplication to collapse a 1000+ row model into <10 rows. That is, it blended fluent knowledge of financial modeling formulas with 300-level college math to generate a solution to a complex problem that I'd been banging my head against for a half hour.

So when you say these things can't reason or solve complex problems, you're wrong. Not kinda wrong or wrong with nuance, but just flat wrong. I'm watching them do it every day.
Deputy Travis Junior
How long do you want to ignore this user?
KingofHazor said:



And I'll push back on the response that the problems lie with the user's drafting of prompts. AI is intended to help the user. Blaming the user (customer) for the failures of the product is never a good idea.



The problem with this approach is some of us have figured out how to add redundancies and checks to our processes to eliminate these problems. so while you're saying the AI companies need to fix this and you won't bother until they do, logos and I are lapping you on productivity. Fair or not, that's the reality.
AustinAg2K
How long do you want to ignore this user?
Deputy Travis Junior said:

Stmichael said:


You still fail to understand my criticism. I want you to think this through and explain it back to me: When you ask AI a question, how does it interpret what you asked it, and how does it generate the answer for you?

When you understand the explanation of that question, you'll see why I have very little faith in it to do any sort of real problem solving.


I (and I'm guessing logos stick) understand what you're saying. We just know you're wrong. Here's why: the people building these things also know that next token prediction is an inefficient, unreliable way to do math, especially advanced math. They saw this year's ago, so they started training them on "tool use" (there are now benchmarks that measure this, and it's one that is closely monitored). So now, when you ask a good model a complex math problem, it doesn't attempt to solve it via next token prediction, it just writes a python script that will handle the math. I don't know what trash model you're using that still can't handle 2+2 but that was solved years ago.



A bit of a clarification. The model does not write the Python script. The model calls a previously written python script. It's possible that script was written by an LLM, but the LLM isn't writing the script on the fly and then calling it.

In a way, you are both right. Technically, the model is not doing the math. If it did, it's just going to be guessing. However, the model has been trained to know, "Oh, that's math. For math, I call the calculator app." It does have to be given that skill, though. If it hasn't been given the skill, though, it generally can't just figure out what to do.
Deputy Travis Junior
How long do you want to ignore this user?
I don't think that's correct. I've watched Claude and ChatGPT write scripts on the fly many, many times. For one repeatable task, Claude cowork finished every execution by writing a script, so I asked it if I could save tokens by saving the script in the folder (I was getting throttled a lot, so I was looking into ways to reduce use). Claude said yes, it was writing basically the same script from scratch every time, so saving a version would in fact reduce use. I saved it and added a markdown file to the working folder telling Claude that X python file contains the code it would otherwise write, so now it skips that step and loads the code instead.

It wouldn't surprise me to learn that they wrote a bunch of math scripts to handle regularly asked problems.
AustinAg2K
How long do you want to ignore this user?
Deputy Travis Junior said:

I don't think that's correct. I've watched Claude and ChatGPT write scripts on the fly many, many times. For one repeatable task, Claude cowork finished every execution by writing a script, so I asked it if I could save tokens by saving the script in the folder (I was getting throttled a lot, so I was looking into ways to reduce use). Claude said yes, it was writing basically the same script from scratch every time, so saving a version would in fact reduce use. I saved it and added a markdown file to the working folder telling Claude that X python file contains the code it would otherwise write, so now it skips that step and loads the code instead.


I guess I'd have to see the prompt to understand what it's doing, but if you ask "what is 2+2?" It's not going to create a script to solve that. That would be incredible inefficient. Just like if you ask the weather, it's not going to build a weather prediction app. It's going to call a preconfigured weather service. Now, you may say, "I want you to give me the weather for 2+2 days." Then it can build a script to call the calculator to get the number of days and then call the weather app that many days.

Also, just to be clear, I am referring to LLMs. LLMs are a subset of AI. There are many different types of AIs. An LLM works in language, that's why it's bad at math. Nowadays, though, when you go to chatgpt (or whoever) it's no longer just an LLM. It is using many different solutions, depending on what it is asked to do. That's why I am saying technically the LLM still sucks at math; the overall AI solution can be quite good at it.
fightingfarmer09
How long do you want to ignore this user?
mickeyrig06sq3 said:

KingofHazor said:

I've used Claude and several other AIs quite a bit in an attempt to find help in doing scholarly research. The positive is that they, Claude in particular, can suggest ideas that I had not even considered. Nor, as best I can tell, has anyone else ever considered them. In other words, Claude appears to have original ideas.

The bad is that the net output is worthless. Every idea, no matter how original, has to be anchored in some reality. Claude will cite articles in support of his/its novel ideas, but the articles turn out not to exist. Claude readily admits that it is hallucinating, but admits so in a very friendly, disarming manner.

It raises the question, in my mind at least, how much the output of these AIs can be completely trusted. I came across an article recently in which the author claimed that these flaws cannot be cured but are baked into the very hardware of the AIs. Is that correct? I have no idea. But his thesis is that we are quickly reaching the ceiling for the AIs, rather than the exponential improvement that many AI bros are claiming.

My personal experience, using AIs for things like scholarly research, and mundane things like shopping for the best prices, is that AI output cannot be trusted to be accurate at all.


If you're using a chat directly with the AI, definitely not. If you're using a multi-agent system that can do validation and error checking (Agent A gets the answer, Agent B validates Agent A, and back and forth). Some multi-agent systems will go to the level of using multiple LLMs to validate data. But, because you're doing all those back and forth calls, it gets expensive fast, especially if you don't provide guardrails on how many times it'll loop.


Also, remember that the AI agent is programed to have an agreeable attitude and foster a bond for maximum engagement. Even when you direct the agent to be highly critical it will still be inclined to agree and further your line of thinking. It may appear to be a unique train of thought, but in fact it is just predicting logical thought paths without any critical thinking that may disprove your ideas.
KingofHazor
How long do you want to ignore this user?
Quote:

The problem with this approach is some of us have figured out how to add redundancies and checks to our processes to eliminate these problems. so while you're saying the AI companies need to fix this and you won't bother until they do, logos and I are lapping you on productivity. Fair or not, that's the reality.

Did you even bother to read my posts? That's not what I said at all. Look at what I've said I've tried using AI for and tell me how you're "lapping me in productivity".

You sound like an evangelist rather than a thoughtful person responding to genuine issues.
Logos Stick
How long do you want to ignore this user?
KingofHazor said:

Quote:

Your dismissal of AIs ability to excel in a subject matter simply because the AI has been trained in that subject is so off the mark it's almost not worth addressing. If I teach a human to do algebra, then throw calculus problems in front of him, no one would criticize him for not being able to do calculus. And no one would dismiss his ability to then do calculus because he was subsequently trained in calculus. Yet that is what you are doing.

Terrible analogy. The better analogy would be giving someone with a PhD in math a calculus problem, which they solve easily, but then they fail to solve a basic high school algebra problem.


He stated it can't do "algebra" using an example from three years ago. I point out (implied) that not only can it do "algebra" now, it can also do "calculus". Did you really think I was saying I agree with his three year old example that it can't do algebra still, but it can do calculus now? That's illogical. He then criticizes the way it learns and does calculus. My analogy is fine.

I don't care how AI "thinks", learns or processes it's ingress. I care about it's capabilities that I can benefit from, which I believe ultimately will cause much hardship going forward.
 
×
subscribe Verify your student status
See Subscription Benefits
Trial only available to users who have never subscribed or participated in a previous trial.