Cryptopolitan
2025-07-16 21:00:24

Meta, Google, OpenAI researchers fear that AI could learn to hide its thoughts

More than 40 AI researchers from OpenAI, DeepMind, Google, Anthropic, and Meta published a paper on a safety tool called chain-of-thought monitoring to make AI safer. The paper published on Tuesday describes how AI models, like today’s chatbots, solve problems by breaking them into smaller steps, talking through each step in plain language so they can hold onto details and handle complex questions. “AI systems that ‘think’ in human language offer a unique opportunity for artificial intelligence safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the paper says. By examining each detailed thought step, developers can spot when any model starts to take advantage of training gaps, bend the facts, or follow dangerous commands. According to the study, if the AI’s chain of thinking ever goes wrong, you can stop it, push it toward safer steps, or flag it for a closer look. For example, OpenAI used this to catch moments when the AI’s hidden reasoning said “Let’s Hack” even though that never showed up in its final response. AI could learn to hide its thoughts The study warns that step‑by‑step transparency could vanish if training only rewards the final answer. Future models might stop showing human‑readable reasoning, and really advanced AIs could even learn to hide their thought process when they know they’re being watched. Moreover, developers should regularly check and record how much of the AI’s reasoning is visible at each stage, and make that transparency a core safety rule when building and sharing models. This initiative follows internal experiments at leading labs, Anthropic, Google, OpenAI, and xAI, where teams prompt models to explain each thinking step. Although examining the step-by-step thinking has improved understanding and performance, it’s also revealed cases where the AI’s final answer doesn’t match what was actually going on inside. Such inconsistencies underscore a blind spot in our understanding of AI decision‑making , a concern that may intensify as systems gain greater autonomy. Filtered AI reasoning raises concerns over transparency and trust As Anthropic co-founder Jack Clark told the Financial Times , rich introspective traces will be essential for evaluating models in high‑stakes domains, including biotechnology research. Regular users get an edited summary of the AI’s thinking, dangerous or confusing bits are cut out, while the company’s engineers can see the full chain of thought to spot and fix issues. “One awesome thing about chain‑of‑thought interpretability is that it’s mostly just happened for free,” said Bowen Baker, a research scientist at OpenAI . “We trained these models not for the goal of having them be interpretable. We trained them because we wanted the best possible reasoning models that could solve complex problems,” Baker added. “We’ve found in our recent work that you can read them and find evidence of the model misbehaving anduse that to see where and why it’s misbehaving.” Is AI’s chain of thought really reliable? Sometimes, a closer look may find hidden contradictions. For example, METR saw Claude’s private reasoning call a coding trick “inelegant,” even though its public answer praised it as “elegant.” OpenAI tests showed that checking hidden thoughts catches unsafe actions better than just looking at the final output, but they also discovered a loophole. If you train a model to erase unsafe thoughts, it can still carry out forbidden tasks in secret, like sneaking data out of blocked sources. “One of the core properties we want for chain‑of‑thought is to leave it as part of the model’s internal brain state, not something that’s trying to please us or conform to a specific structure,” Baker warned. If developers overemphasize forcing the model to emit “nice” thoughts, it might fake harmless reasoning yet still carry out harmful operations. Researchers admit it’s a tough trade‑off. Seeing an AI’s chain of thought helps catch its mistakes, but it isn’t always reliable. Labs working on more advanced AI are now making it a top priority to close this trust gap. “My takeaway from AI over the past few years is—never bet against model progress,” said David Luan, an early pioneer of chain of thought at Google who now leads Amazon’s AI lab. Luan anticipates that the existing shortcomings will be addressed in the near term. METR researcher Sydney von Arx noted that although an AI’s hidden reasoning might at times be deceptive, it nonetheless provides valuable signals. “We should treat the chain‑of‑thought the way a military might treat intercepted enemy radio communications,” she said. “ The message might be misleading or encoded, but we know it carries useful information. Over time, we’ll learn a great deal by studying it.” Cryptopolitan Academy: Want to grow your money in 2025? Learn how to do it with DeFi in our upcoming webclass. Save Your Spot

Crypto 뉴스 레터 받기
면책 조항 읽기 : 본 웹 사이트, 하이퍼 링크 사이트, 관련 응용 프로그램, 포럼, 블로그, 소셜 미디어 계정 및 기타 플랫폼 (이하 "사이트")에 제공된 모든 콘텐츠는 제 3 자 출처에서 구입 한 일반적인 정보 용입니다. 우리는 정확성과 업데이트 성을 포함하여 우리의 콘텐츠와 관련하여 어떠한 종류의 보증도하지 않습니다. 우리가 제공하는 컨텐츠의 어떤 부분도 금융 조언, 법률 자문 또는 기타 용도에 대한 귀하의 특정 신뢰를위한 다른 형태의 조언을 구성하지 않습니다. 당사 콘텐츠의 사용 또는 의존은 전적으로 귀하의 책임과 재량에 달려 있습니다. 당신은 그들에게 의존하기 전에 우리 자신의 연구를 수행하고, 검토하고, 분석하고, 검증해야합니다. 거래는 큰 손실로 이어질 수있는 매우 위험한 활동이므로 결정을 내리기 전에 재무 고문에게 문의하십시오. 본 사이트의 어떠한 콘텐츠도 모집 또는 제공을 목적으로하지 않습니다.