4
Went from guessing AI training data size to actually tracking it properly
I used to just dump whatever CSV files I had into my model and hope for the best. Three months ago I started logging every single dataset I use, including row counts and source details. Now I keep a simple spreadsheet with columns for date, file name, row count, and where it came from. It caught a duplicate set I accidentally uploaded twice that was messing up my results. Anyone else track their training data this closely or do you just eyeball it?
2 comments
Log in to join the discussion
Log In2 Comments
adam67518d ago
Started doing the exact same thing after I found 15k rows of test data mixed into my training set. Saved my ass when I realized one vendor kept sending me the same customer data every month.
2
angela_allen5318d ago
My coworker paid $200 for a "premium" data cleaning tool last month, and it still missed the exact same duplicate pattern your vendor had. Reminds me of how people buy expensive kitchen gadgets when a simple knife and cutting board handle 90% of the work. Real world patterns like that are everywhere if you just look for them.
4