Data Idempotence - Ramshankar Yadhunath

Inspired of this [article](https://newsletter.casewhen.xyz/p/data-explained-idempotence), I ask myself some questions on data idempotence and also attempt to answer them based on my experience. ### What is Idempotence? Idempotence is a property of an operation to produce the same results no matter how many times an operation is run. Idempotence is a part of our day to day lives too. Think about the last time you checked how much money you had left in your account. If you had £100 in your account, you would have the same no matter how many ever times you check your account (provided you have had no transactions in between). On the other hand, if you had bought yourself a nice coat for £60, your bank balance would have fallen to 40£. You can expect a similar change if you add money to your account too. In this case, the operations of addition / subtraction are not idempotent. Here is a cheeky four minute video on idempotence in REST APIs. <iframe width="853" height="480" src="https://www.youtube.com/embed/6dVNdFwqeKs" title="Idempotency in Cows and REST APIs" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> In the context of data transformation, idempotence means a pipeline that runs more than once will have the same effect as the pipeline running once (provided the input data has not changed in between). #### When do you know something is probably not idempotent? - You see duplicate rows when you run the same transformation code on top of the same input data - You see stale data i.e data that no longer exists or needs to exist ### How to check if a pipeline / piece of transformation code is idempotent? The simplest way to check for this is counting the number of rows of the output. Let's say you have a piece of transformation code written in a `.sql` script. The script reads some data from table A and outputs table B. If you want to look for idempotence, the easiest thing to do would be to run the transformation code once, count the number of records in B. Then, run the transformation again (assuming that A has not changed) and counting records in B again. Are the counts the same? Or are they different? If they are the same, then the transformation code is "likely" idempotent. *Likely? Should it not be definite?* . As a matter of fact, no. Think about how B is populated? Is B always fully-refreshed or is data loaded into B in incremental fashion? > 🤓 ***Explainer*** > In a full-refresh load, all the contents of the target table are replaced with new contents. In an incremental load, only those records that are not present already in the target table are added into it. If table B is incrementally loaded, there is bound to be some logic lying in somewhere that helps *uniquely identify records*. This is useful for *deduplication*. Sometimes, deduplication is performed using window functions such as `ROW_NUMBER()`. Using a window function without an `ORDER BY` could lead to non-idempotent results. The deduplication might ensure that 10 duplicate records in table A would become a single record in table B. But, not having an `ORDER BY` would mean that each time the pipeline is run a different record might show up (and this could be a problem if the order of records is important). Thus it becomes important to also check if the records that are being returned are the same as previous runs of the pipeline in consideration. ### How to ensure a data pipeline is idempotent? The first requirement is to have a unique key in place. This unique key can either be a - Primary Key - Composite Key - New field generated by hashing together all other fields Having this unique key will help individually check for / update / insert records. With a unique key in place, the focus is on how incremental logic is handled. As discussed earlier, if a table is loaded in full-refresh mode, idempotence is easily achieved (of course, important to check the transformation logic to be sure of this). But, if incrementally loading data, one might need to consider more questions - Does the upstream data have timestamps? You need timestamp information to select only new data. Without this, an incremental load could end up adding in a lot of duplicate data - If performing de-duplication, is consistent ordering of the records followed? If not, you will have to deal with inconsistent states of data - Is the upsert (update and insert) logic working as expected? - `UPDATE` a record with new information if it already exists, but some information has changed - `INSERT` a new record if it does not already exist Apart from the above precautionary steps, it might also be useful writing some data checks in place to warn of possible idempotency issues. Some examples of this are - How to check is duplicate records are added in? - SQL-based checks - SaaS services checking for data quality - How to check if accidental deletion has not occurred? - SQL-based queries to check the "before" and "after" states of records after a transformation pipeline has run - Create a logging and monitoring system to check the above. Alert if there is a change. - How to check if the final state of the output table is consistent to what you would expect based of the state of the input table? - Data validation checks via SQL / SaaS tools - Look for usual suspects - Row counts - Summary statistics - Referential integrity ### Questions to Ask Yourself Depending on the use case you are working on, its worth asking some of these questions and getting to clear answers before implementing a code-based solution. #### What happens when a pipeline runs twice? - Does a table get deleted entirely and re-created again? - Does the pipeline run with no new records added into the table if the source has not changed? - Does the pipeline run with no new records added into the table even when the source has changed? 💀 - Does the pipeline add in duplicate records into the table? 💀 ### References 1. [How to make data pipelines idempotent](https://www.startdataengineering.com/post/why-how-idempotent-data-pipeline/) 2. https://www.fivetran.com/blog/10-data-pipeline-challenges-your-engineers-will-have-to-solve 3. https://www.fivetran.com/blog/idempotence-failure-proofs-data-pipeline 4. https://notesbylex.com/idempotent-data-pipelines 5. https://www.cockroachlabs.com/blog/idempotency-and-ordering-in-event-driven-systems/