Cleaning up 1000+ rows of bad real estate data

bernard

BuSo Pro
Joined
Dec 31, 2016
Messages
2,648
Likes
2,335
Degree
6
Suppose you had 1000 rows of real estate data, each corresponding to a project with aggregate data (features, price ranges, sq feet etc) that had errors in it, specifically miscategorized bedrooms, making the aggregate data such as price ranges for studios or 1 bedroom faulty or likewise the size ranges faulty.

The data has been collected by scraping other real estate states who in turn have it from agents who are the one's making the "mistakes" likely to get more views when people sort by low price and such.

Would you attempt to clean this up algorithmically using some kind of data engineering woodoo or would you clean it up manually and perhaps build a backend to do so with Retool for example and have some virtual assistants do it.

Since its aggregate data and project features, it's unlikely to change a lot for a while, so I'm considering if yearly manual updates would be worth it. No one else seems to be doing this though, so I wonder if they don't care, however, I need accurate data for my purposes.
 
For something like this, as much as I'm into "automating the planet", I'd consider a manual process or at the very least something with "human in the loop" for supervision. Unless you can define a list of things that need to happen to "fix" the data and every possible thing to look out for and that list would never change, any attempt at porting that to a script isn't going to work well. This doesn't sound like something that could be put into a simple algo, you'd be constantly adding the edge cases to make up for where the algo doesn't know what to do. You could consider an AI agent that can help reason things out but with something like 1000 rows, it's most cost effective in both time and money to let a good VA sort that out.
 
I would definitely attempt to automate it first (with a safe backup). Because doing it manually sounds like a total bitch.

I would use an agent for this by providing it the correct data and telling it to audit for mistakes.

If you have it locally it might be worth showing it to Claude Code or any agent in your IDE. Put it in a directory and open the agent in it and tell it what you need to do.

If the size of the data is a problem you can always batch it out by splitting 1000 rows into 5 groups of 200 for example, and doing them one at a time.

The hard part would be providing clean guidelines for which data entries are incorrect. Otherwise it is highly likely to make mistakes. And if it reaches a point where setting up a reliable automation system is time consuming it may be faster to do it manually.
 
Back