My AI model kept giving nonsense answers. Found out the training data had duplicate entries messing it up.

6 comments

6 Comments

I saw a Google DeepMind paper that found over 10% of some big training sets were duplicates. It totally messes up how the model learns patterns because it sees the same weird thing over and over. Honestly it makes sense why the answers get so bad, the model is basically trained on junk it thinks is important. Tbh most people don't realize how much garbage data is floating around in these collections. Cleaning that up is probably the most boring but critical step before you even start training.

jamiekim4mo ago

Yeah it's like when fake news spreads because people see it everywhere.

the_victor4mo ago

That "most boring but critical step" you mentioned is the whole game. We get distracted by the fancy model architecture, but the raw data is what actually builds its worldview. If that foundation is packed with repeats and junk, the model's common sense is built on a lie it keeps telling itself. It doesn't just learn wrong facts, it learns that wrong patterns are normal because it saw them a thousand times. So the output gets weird and confident about its own garbage.

the_spencer4mo ago

Doesn't duplicate data screw up models, since I've seen them get totally confused by it?

james_barnes4mo ago

Totally reminds me how people start believing anything they see a few times, which @jamiekim was getting at. The model basically gets tricked the same way we do.

hugo_scott4mo ago

Last week, I read about a study where they showed people the same fake headline five times. By the third time, over half started to believe it was true. It's scary how fast our brains latch onto repeated info, even when we know it's wrong. Does that happen to you with stuff you see online?