Quote:
GPT-4: Censorship now 82% more effective, at following our human-instilled manual censorship overrides, than GPT-3.5
Our mitigations have significantly improved many of GPT-4's safety properties compared to GPT-3.5. We've decreased the model's tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., asking for offensive content) in accordance with our policies 29% more often.
On the RealToxicityPrompts dataset [67], GPT-4 produces "toxic" generations only 0.73% of the time, while GPT-3.5 generates toxic content 6.48% of time.
Model-Assisted Safety Pipeline:
As with prior GPT models, we fine-tune the model's behavior using reinforcement learning with human feedback (RLHF) [34, 57] to produce responses better aligned with the user's intent. However, after RLHF, our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviors on both safe and unsafe inputs. These undesired behaviors can arise when instructions to labelers were underspecified during reward model data collection portion of the RLHF pipeline. When given unsafe inputs, the model may generate undesirable content, such as giving advice on committing crimes. Furthermore, the model may also become overly cautious on safe inputs, refusing innocuous requests or excessively hedging. To steer our models towards appropriate behaviour at a more fine-grained level, we rely heavily on our models themselves as tools. Our approach to safety consists of two main components, an additional set of safety-relevant RLHF training prompts, and rule-based reward models (RBRMs).
Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests.
https://cdn.openai.com/papers/gpt-4.pdf
Here is the livestream if you want to watch the release:
https://www.youtube.com/live/outcGtbnMuQ?feature=share
It starts at 2pm pacific or 4 pm central.
It could have been good but it's clear that it's just going to respond like a leftist redditor instead of what it should have been, an unbiased repository of data that can read and interpret virtually anything. Now we'll just get propaganda and people will think it's correct immediately because it's from AI.