Getting data ready for AI - Ramshankar Yadhunath

![[ai-ready-data.png]] Lak Lakshmanan recently wrote a piece on what it means to get your data ready for AI. As a data person with interests in AI, I just had to read the piece! Also, Lak was one of the instructors of a GCP Data Engineering certification course on Coursera. Though I never really finished the course, his mode of instruction stuck with me. So, it was nice to finally read one of his pieces after a really long time! Now, about the article as such, the points definitely were good ones to think about. At its most fundamental level, the article alludes to the changing landscape for data engineering. Data engineers have always worked a lot more with data[^2] than anything else. Enterprises too have generally cared about their data platform and the state of how organised, accessible, clean and protected their data was. But since the generative AI era began, there has been a not-so-subtle shift in how enterprises want to "unlock" their data. In theory, every executive wants to be able to plug their "AI chatbot" powered by a "state of the art LLM" which will act as the "oracle" for the business[^5]. Lak argues that this means data engineering will fundamentally shift to being a job role that keeps at its centre the "Gen AI model" instead of the "Data platform"[^3]. Some of the ideas discussed in this post have also been bits I have come across over the past year on projects I have been a part of[^1]. Lak's piece revolve around 5 key ideas and I definitely have some opinions on each. Hopefully this post will help kindle your own thoughts on this. If you would much rather prefer reading the original piece, here it is - [What it means to get your data ready for AI](https://ai.gopubby.com/what-it-means-to-get-your-data-ready-for-ai-518861a8f025). >[!info] >There are some "Playing the Devil's Advocate" sections I have popped into this article. These are somewhat direct (maybe crude) questions that I am jotting down off the top of my head as I write this. I assume these questions will help you think critically about this whole data <> LLM handshake just like they help me! ## Context beats Normalisation Normalisation is a core requirement for OLTP[^4] systems. For years, we've been trained to normalise everything. Cleaner schemas, less redundancy, more tables. However, the more tables a platform has, the harder it is for an agent to actually _understand_ your data. Agents work better with denormalised, context-rich data. They need to see the full picture in one place rather than piecing it together from a dozen joins. I would argue however that data warehouses already do denormalise data. But of course, thinking about "context-rich" tables (whatever the hell that means) might organically become a centrepiece of data modelling discussions for the AI age. For example, a project I recently worked on required that we use some generative AI to "impute" missing values (values were missing because source platforms did not provide them) in certain fields by using a bunch of other fields as "inputs". The choice I undertook at that stage was to bunch up these "input" fields into a JSON and pass this JSON as the input to the LLM. So, for each row we had a JSON input that will help the LLM generate the correct output for the missing field. Such an approach, while simple could only be thought about because of the denormalised nature of our table. Lak also advises against the very seductive idea of pushing all your unstructured documents into a store and expecting the LLM to parse it. While LLMs can do that, we are better off parsing the docs into structured outputs and then feeding those to the LLM. >[!danger] Playing the Devil's Advocate >1. Why do we even need to give the LLM access to "raw" data just so business users can ask questions of the data? Would that approach not be expensive? What is wrong with just the semantic model and actually modelling data for self-service analytics? >2. Is it really secure to just let every other MCP have access to raw data of an enterprise? ## Curation over Collection The "big data" mantra was: collect everything, more data is always better. Scale your warehouse, ingest faster. But that era was built for a specific problem: training machine learning models on massive datasets where statistical patterns hidden in volume actually mattered. Agents operate on a completely different principle: **in-context learning**. Give them two or three good examples in the prompt, and they'll pattern-match their way through the problem. Whether it's formatting output, following a reasoning process, or just understanding how to handle a specific scenario. Quality of exemplars is touted to beat quantity every single time. So instead of asking "what can we ingest?", the question becomes "what examples do we store?". In a way, this becomes a "less is more" sort of thinking. The other thing Lak points out that stuck with me: as a data engineer, you're increasingly enabling a new persona, the _data curator_. Someone who understands domain logic well enough to say "yes, use this example" or "no, this one is ambiguous." It's more craft than scale. >[!danger] Playing the Devil's Advocate >1. Doesn't identifying "best examples" just bake in the biases of whoever chooses them? If you're hand-selecting what the agent learns from, aren't you just replacing algorithmic bias with human bias? So, in sectors where data sensitivity becomes important, does this mean an additional responsibility of sorts with the data engineer? >2. How much effort is it really worth spending on curation? At what point does the time spent choosing perfect exemplars outweigh the marginal improvement in agent performance? I for once have seen how for most companies, data quality is subjective. And often the strictness of such quality is constrained by budget, time and perceived need. ## Build Agent-Ready Infrastructure: Perception and Action Agents need to _see_ your data and _do_ something with it. Those are two different infrastructure problems, and most teams only think about one. Perception is about whether an agent can actually parse and understand your data formats. If you're handing it a CSV, a JSON blob, a denormalised table, can it extract meaning without a lot of preprocessing? Some formats are just hostile to language models. Overly normalised schemas force agents to reason about join logic. Proprietary binary formats require explanation. But a well-structured JSON with clear field names? A paragraph from a contract that explains the financial terms? Agents can work with that directly. The action part is tool calling. Your infrastructure needs APIs, functions, services that the agent can _invoke_. And here's the thing: it needs to make those tools discoverable. Not just documented somewhere, but discoverable to an autonomous system that's trying to figure out what it can do. I am not a DevOps engineer, but I kind of understand how DevOps showed up as a discipline with its core tenets based on opinionated ideas of professionals. Take **Infrastructure as Code** as an example. I could deploy a data platform by clicking a bunch of buttons on a UI OR I could write a bunch of code that will do the same. The latter is obviously the better choice if you are a developer as you end up having repeatable, reproducible code that deploys the same infrastructure every time you run it. This is a *human-preference*. Similarly, the new age of agents will require infra to think from the perspective of the agent. >[!danger] Playing the Devil's Advocate >1. If we're building "agent-ready" infrastructure separate from human-usable infrastructure, aren't we just fragmenting the data platform further? Who maintains it? Who's responsible when they diverge? >2. Making tools "discoverable" to agents, doesn't that increase attack surface? If an agent can find and call anything, what stops a malicious agent (or a confused one) from doing damage? ## Treat Agent Artefacts as First-Class Data We think of agents as consuming data. Query the warehouse, get an answer. Done. But agents produce constantly. Classifications, summaries, decisions, code, reasoning traces. A lot of it. And it compounds. If you're running agents at scale, AI-generated content will quickly outweigh your "raw" data. And data needs the full treatment. Storage. Versioning. Governance. Audit trails. Personally, this will be very interesting. Hear me out here - If your data engineers use an agent to write a bunch of code in your data model, the "theory of your data engineers"[^6] will show up in the chat logs they have with the agent. This free-form unstructured data can be a gold-mine if used well, but how will this be ingested in? Where will it be stored? How will it be analysed? The other thing Lak mentions that's important: agent artefacts inherit the sensitivity of their inputs. If you classify a sensitive customer record, that classification is sensitive too. Governance doesn't stop at raw data. It will need to also continue into data generated by agents. Basically, a bigger headache! >[!danger] Playing the Devil's Advocate >1. Storing every agent decision, every reasoning trace, every intermediate step. That's a lot of data to manage. When do you delete it? Do you ever? >2. "Reasoning traces" from an LLM aren't actually reasoning; they're post-hoc narrative generation. Are we fooling ourselves by storing them as if they're explanations? >3. If we do not store the output of these agents, how does any team know that they are doing well or poorly? ## Close the Loop Between Observation and Training The hardest part of this one is that it requires thinking about your agent as a _system_, not a deployed model. Most teams deploy an agent and hope it keeps working. Bugs creep in slowly. Data drifts. The thing that made sense in November no longer makes sense in January, but nobody notices for weeks. By then, it's made thousands of wrong decisions. Lak's point is structural: you need observability that connects back to retraining. Not eventually. Continuously. That means two things. First, monitoring that actually matters: - **Data quality metrics**: is your input data still clean? - **Data drift**: are the characteristics of the data changing? - **Concept drift**: has the relationship between input and output changed? (This one is sneaky. The data can be fine, but what you care about might have shifted.) - **Model performance**: accuracy, latency, hallucination rates. Second, _human feedback_. Every correction a user makes to agent output is signal. The agent generated the wrong classification? Log it. The summary was off? Log it. That's not noise; that's training data for the next version. Then you actually use it. This is the part most teams skip. You trigger a retraining pipeline. Pull the latest curated examples. Fine-tune the agent. Run evaluation tests. Deploy automatically if it passes. Honestly, this section is something I really have no clue about yet. So, one for the future! >[!danger] Playing the Devil's Advocate >1. If you're retraining agents constantly based on user feedback, doesn't the system become unpredictable? Same input, different output over time. Doesn't that break trust? >2. "Automated retraining" sounds great until something goes wrong. Who's responsible when the automated pipeline trains the agent on bad data and makes things worse? Thanks for reading this till here! As most of my notes on here, this too is a working doc which will find itself updated down the line! [^1]: As I am a consultant, I tend to have a variety of projects under my belt generally. At times, patterns emerge. Other times, I am just trying too hard to make something a pattern i.e **Consultant Gobbledygook**. [^2]: Surprise surprise! [^3]: Personally, I do not think such a reframing would work. Especially since optimising your data platform for AI agents is easier said than done. Most important factor that makes me say this - "Value for Money". Unless proven that such a modernisation will reap in benefits for an enterprise, without ANY compromises on security, no enterprise would really go for such a project at scale. Anyways, these are still just hypotheses that will need to be validated as time goes on! [^4]: Transactional databases. Think your bank. [^5]: Some would call this a gross oversimplification, but at the end of the day for a business user this is what it all looks like! [^6]: This "theory" is basically all that tacit, unwritten knowledge in the head of your data engineers that they deem "not important" to share. But in reality, these are most important for future engineers to develop on the platform.