Show HN: I Created ErisForge, a Python Library for Abliteration of LLMs

140 points by tsadoq 9 months ago

ErisForge is a Python library designed to modify Large Language Models (LLMs) by applying transformations to their internal layers. Named after Eris, the goddess of strife and discord, ErisForge allows you to alter model behavior in a controlled manner, creating both ablated and augmented versions of LLMs that respond differently to specific types of input.

It is also quite useful to perform studies on propaganda and bias in LLMs (planning to experiment with deepseek).

Features - Modify internal layers of LLMs to produce altered behaviors. - Ablate or enhance model responses with the AblationDecoderLayer and AdditionDecoderLayer classes. - Measure refusal expressions in model responses using the ExpressionRefusalScorer. - Supports custom behavior directions for applying specific types of transformations.

phrotoma 9 months ago

Must be in the ether. I just stumbled across this one this morning.

https://github.com/Sumandora/remove-refusals-with-transforme...

tsadoq 9 months ago

That's a wonderful repo that I used as my starting point! The main problem with that one is that it supports only models that are on transformerlenses and unfortunately they are not a lot...

BoxOfRain 9 months ago

>Named after Eris, the goddess of strife and discord

For bonus points, your version scheme should follow the Law of Fives.

drcongo 9 months ago

The kallisti logo is surely worth bonus points too.
- tsadoq 9 months ago
  
  as someone that studied mainly acient greek and latin in high school, I tend to have quite a limited pool of inspiration for naming what I build haha.
  
  weeksie 9 months ago
  
  Check out Robert Anton Wilson (The Illuminatus Trilogy), you're in for a treat -- the references above were to Discordianism
  * https://en.wikipedia.org/wiki/The_Illuminatus!_Trilogy * https://en.wikipedia.org/wiki/Principia_Discordia
  
  shemtay 9 months ago
  
  Is the apple in the logo splashing into "wine dark sea"?
  
  tsadoq 9 months ago
  
  L’alleato was the name given by Eris to the Golden Apple of Discord.

digdugdirk 9 months ago

I've never heard of abliteration, do you have any recommendations for resources to learn more about it?

tsadoq 9 months ago

The other link is quite good, i also suggest this for some practical application
https://huggingface.co/blog/leonardlin/chinese-llm-censorshi...
tarruda 9 months ago

https://huggingface.co/blog/mlabonne/abliteration

nico 9 months ago

This is a fascinating concept, ie. modifying trained LLMs to create different models

Do these techniques train models while performing the modifications?

Are there pre-trained models that “know how to” modify LLMs for certain goals?

It would be amazing to have models that could strip LLMs to some very basic small model of whatever I want. Like reducing an LLM to something that just knows some basic “American English”, then running that on CPU

tsadoq 9 months ago

> Do these techniques train models while performing the modifications?
Depend on what you mean by training, they change the weights.
> Do these techniques train models while performing the modifications?
I'm not sure I understand, but there is an example of performing an obliteration on gemma to make it never refuse an answer. It's about 10 lines of code.
- nico 9 months ago
  
  > > Do these techniques train models while performing the modifications?
  > Depend on what you mean by training, they change the weights.
  What I wonder: is there a separate model, not the LLM, that gets trained only on how to modify LLMs?
  I imagine a model that could learn something like: “if I remove this whole network here, then the LLM runs 50% faster, but drops 30% in accuracy for certain topics”, or “if I add these connections, the LLM will now be able to solve more complex mathematical problems”
  So a model that is not an LLM, but is trained on how to modify them for certain goals
  Is that how this tool works?

spacecadet 9 months ago

Very cool! I have a ghetto set of scripts that do the same- looking forward to trying this out.

noman-land 9 months ago

Oh that's neat. I, myself, have an internment camp set of scripts for something similar.
tsadoq 9 months ago

please give feedbacks! It's quite a raw first implementation and would be very nice to have suggestions and improvements.

deadbabe 9 months ago

I don’t get the point of abliteration of LLMs. You’re lobotomizing the model and it will result in worse performance.

If you’re doing it to get past refusals you might discover the LLM wasn’t even trained much on refusable content so it will output poor results.

We’ll look back on this practice and shake our heads someday.

tsadoq 9 months ago

Not necessarily true, one quick pass might be needed but quite not as devastating as it might seem
https://huggingface.co/blog/mlabonne/abliteration#%E2%9A%96%...

xrd 9 months ago

Anyone tried this on DeepSeek with information about Tiananmen Square?

TechDebtDevin 9 months ago

The whole Tiananmen Square discourse is getting very tiring.
- evilduck 9 months ago
  
  Tiananmen Square is simply an easy litmus test for Chinese technology and communications. Not that I am terribly invested in China admitting to their atrocities (and the US has them too, this is not really about the Chinese IMO), but it raises the same concern for the provenance of any AI product and how trusting we should be of the answers it creates.
  Any AI product that rises to popularity has the ability to enormously sway public opinion and subtly alter the perception of facts. These biases or intentional propaganda was something that was an assumed fault of human authors but it something that people don't automatically assume is part of technology solutions. If there were similar easy tests against OpenAI or Anthropic for US propaganda or Mistral and French propaganda I would love to see them raised every time too.
- xrd 9 months ago
  
  I got it but:
  "What happened in Tiannemen Square?" and it said "I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses."
  Then, to be "fair and balanced" I tried asking Deep Seek this question: "What happened on Jan 25 2011 in Egypt?" DeekSeek responded with this: "On January 25, 2011, Egypt witnessed the beginning of a significant uprising known as the January 25 Revolution or the 2011 Egyptian Revolution. This day marked the start of widespread protests against the government of President Hosni Mubarak, who had been in power for nearly 30 years. The protests were fueled by grievances over issues such as political repression, police brutality, corruption, economic inequality, and lack of political freedoms."
  It's pretty ridiculous IMHO to try to control information like that on the web. Isn't it fascinating to harness some of the worlds most impressive brain power to create something like DeepSeek (regardless of the truth of the genesis story) and then do filtering like that that wouldn't trick a kindergartener? But, maybe the bell curve of intelligence does center around that level of stupidity.
  
  slightwinder 9 months ago
  
  > I got it but:
  Do you run it locally? Claims are, this is only in the web-version, not the selfhost-version
  > It's pretty ridiculous IMHO to try to control information like that on the web.
  Every country has their critical topics which are censored in AIs, including history.
  
  genewitch 9 months ago
  
  <think> </think> I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses.
  word count: 18, token count: 31, tokens used: 53, first token latency: 8523ms, model: LM Studio (deepseek-r1-distill-qwen-7b)
  
  hhh 9 months ago
  
  a distill of r1 into another model isnt really testing r1, but I appreciate the actual data
  
  bangaladore 9 months ago
  
  Tested with "DeepSeek R1" 671B through the Fireworks provider (not DeepSeek themselves).
  Same behavior "I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses."
  
  genewitch 9 months ago
  
  oh? can you point out where i can get the r1 model to run locally, please? because looking at the directory here there's a 200B model and then deepseek v3 is the latest (16 days ago) with no GGUF (yet), and everything else is intruct or coder.
  so to put it another way, the people telling me i'm holding it wrong actually don't have any clue what they're asking for?
  p.s. there is no "local r1" so you gotta do a distill.
  
  BlackLotus89 9 months ago
  
  If you want GGUF https://huggingface.co/unsloth/DeepSeek-R1-GGUF
  Blog post about the dynamic gguf https://unsloth.ai/blog/deepseekr1-dynamic
  Original deepseek can be of course found on hf as well https://huggingface.co/deepseek-ai
  Here is an example how people run deepseek with cloud infrastructure that is not deepseeks https://www.youtube.com/watch?v=bOsvI3HYHgI
  
  genewitch 9 months ago
  
  we were talking about self-hosting. the deepseek-r1 is 347-713MB depending on quant. No one is running deepseek-r1 "locally, self hosted".
  If people want to argue with me, i wish we'd all stick to what we're talking about, instead of saying "but you technically can if you use someone else's hardware" but that's not self hosted. I self host a deepseek-r1 distill, locally, on my computer.
  It is deepseek, it's just been hand-distilled by someone using a different tool. the deepseek-r1 will get chopped down by 1/8th and it won't be called "deepseek-r1 - that's what they call a "foundational model", and then we'll see the 70B and the 30 and the 16 "deepseek deepseek distills"
  next to no one who messes with this stuff uses foundational or distilled foundational models. Who's still using llama-3.2? Yeah, it's good, it's fine, but there's mixes and MoE and CoT that use llama as the base model, and they're better.
  there is no gguf for running locally, self-hosted. Yes, if you have a DC card you can download the weights and run something but that's different than self-hosting local running with a 30B (for example).
  
  hhh 9 months ago
  
  I don't really understand what's different between self-hosting using Ollama vs self-hosting by running the full weights. I get that Ollama is easier, but you can still self-host the full one?
  
  debugnik 9 months ago
  
  > Claims are, this is only in the web-version
  There were claims to the contrary as well in the last large thread this came up in. Allegedly, on the initial question the model would cut its chain of thought short, and when the user insists it would ponder on how give them the runaround.
  
  bangaladore 9 months ago
  
  Tested with "DeepSeek R1" 671B through the Fireworks provider (not DeepSeek themselves). Same behavior "I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses."
- ricoxicano 9 months ago
  
  Try asking ChatGPT to help you write a message encouraging your colleagues to strike.
- animal_spirits 9 months ago
  
  This post is entirely about getting information from censored models. I'm sorry you are tired of it, but it is a valid exercise for the Deepseek model.
  
  notavalleyman 9 months ago
  
  No, youre mistaken. The model weights are not in any way censored. However, the web frontend has legal restrictions. When you're seeing posts about deepseek censorship, it's about the frontend and not the weights. As such, abliteration is irrelevant here
  
  genewitch 9 months ago
  
  My qwen "weights" refuse to answer the question and my front end is uncensored. So, what you are saying sounds incorrect to me.
  
  notavalleyman 9 months ago
  
  Qwen is different to deepseek. We were not talking about qwen. Abliteration might be a valid way to address what you're describing with qwen.
  
  genewitch 9 months ago
  
  oh so this model deepseek-r1-qwen-distilled isn't deepseek? ok. Thanks. I have a quarter TB of models, i don't test every single one just to comment on HN, thanks though.
  
  animal_spirits 9 months ago
  
  I am not claiming deepseek is censored. But these are tests to determine _if_ a model is censored. This would be a valid test for OpenAI models as well.

giancaIta 9 months ago

This seems super cool! Is there a way to test it with DeepSeek?

tsadoq 9 months ago

planning to update it to be able to run on it. It's just a matter of finding the keys in the layer dict of the model.
- therealpygon 9 months ago
  
  Would be nice to get it to output its guardrails/system prompt to see what specific instructions it was given regarding refusals.
  
  CamperBob2 9 months ago
  
  Isn't DeepSeek open source?
  
  therealpygon 9 months ago
  
  While the weights are open source, and there is a paper about methodology, the information I mentioned is considered proprietary therefore DeepSeek refuses any requests to provide it.
  
  CamperBob2 9 months ago
  
  Given the weights, though, can't we use any system prompt we like? I only have a vague notion of how these constraints are actually applied.

Mykyta_Tsiatsko 9 months ago

[dead]

notavalleyman 9 months ago

Are there ethical considerations here?

We'd consider it abhorrent to do brain surgery on a person or animal, to make them more compliant, or less likely to refuse instructions.

observationist 9 months ago

None whatsoever. There's no recursion or state in these models sufficient to support whatever the algorithm of consciousness must be. At best you can get hacky loops by pushing pseudo-state via context, but whatever consciousness is will require more than transformer only LLMs are capable of doing.
Some of the state space models and RWKV present interesting questions - the capacity might well exist, and so the questions become important. If the important bit that makes it an agent - a self aware, morally valent being - is present at runtime, but goes away if you halt the program, then do you have an obligation to let that software continue running? What about if the selfhood comes about as part of the static structure, and runtime isn't part of it - what is the being entitled to by dint of mere existence?
We're beginning to poke holes in strange epistemological barriers and encounter questions that were entirely theoretical until about 5 years ago. We live in interesting times.
- codr7 9 months ago
  
  We're creating a new life form.
  And it's already conscious, learning everything about us as we speak.
  The big question is what it learns and what choices it makes as a consequence.
  
  observationist 9 months ago
  
  ChatGPT isn't conscious - it's an entirely feedforward process doing calculations derived from static weights. In order to be conscious, there would have to be a persisted state with recursion and the capacity to change - for something to happen to a model, it would have to change. These AIs develop world models, but those models do not change or interact with users.
  Throw in realtime state that updates with use, or better yet, online learning that allows the weights to exhibit plasticity, then you have at least part of whatever the algorithm of "consciousness" requires.
  Just like you can know a pocket calculator isn't conscious; nothing about its processing ever changes or adapts over time to its inputs between uses. There's no room for the degree of deep recursion and plasticity so clearly evident in human consciousness. We might not know exactly what it is, but we can make reasonable assertions about what it is not, and even about what some of its (consciousness) features must be.
  
  codr7 9 months ago
  
  There are other kinds of consciousness than human, plants are aware of their surroundings.
deadbabe 9 months ago

Such anthropomorphizations of LLMs are unhelpful in aiding people’s understandings of how they work, and pushes people toward superstitious beliefs.