I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.

It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.

Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?

4
5
PaulShin 1 hour ago

This is a fantastic question that gets to the heart of a huge pain point in applied AI. You're not missing something obvious; you've just discovered a gap in the market that we're obsessed with solving at my startup, Markhub.

Your problem of cleaning a dataset ("remove all records without foul language") is functionally identical to the problem our users face every day: cleaning up messy team conversations ("turn this chaotic chat into a clear list of tasks").

Our approach has been to build an AI agent, MAKi, that acts as an interface layer on top of unstructured data. Instead of writing complex scripts, our users simply talk to MAKi.

For example, they can highlight a long conversation and give it a prompt like, "Extract all action items from this, assign them to the relevant person, and set due dates for next Friday."

MAKi parses the request, understands the context, and generates structured To-Do items, effectively "cleaning" the conversational data into an actionable format. We call this a "Conversation-Driven Workflow."

While our use case is collaboration, the underlying technology to "interact with a dataset via prompts" is exactly what you're looking for. It seems like the next wave of AI tools won't just be models, but intuitive interfaces for manipulating data with natural language.

jonahbenton 1 day ago

+1. Have used AI to write code for me to do various cleaning steps but as the process of iterating over data cleaning is usually one of discovery, both in terms of the data and also in terms of requirements, especially in the context of problems, have not found a conversational workflow tool capable of working at the right level of abstraction to be useful. Curious if any folks have.

kevinherron 1 day ago

Submit for batch processing using the OpenAI batch API?

sprobertson 1 day ago

This in combination with making the tool call / json response itself also "batched" is a good pattern. Instead of returning a single `{english, foul}` object per record, pass in an array of records and have it return an array of `[{english, foul}]`. Adjust the inner batch size depending on your record size and spread the rest over the batched API.

constantinum 1 day ago

There is https://www.visitran.com/ which is still in closed beta.