Datastrophes: Il Buono, il Brutto, il Cattivo
In this blog, I want to introduce you to something I had to deal with as either an engineer, a scientist, a leader, or a manager: Datastrophes.
Nevertheless, because most events have an opportunistic component, I will not only cover the odd-part of datastrophes but also what and how you could benefit from them.
Here are the key takeaways in this article:
- The success of a data project depends on a series of assumptions to be valid and robust.
- A datastrophe is an impactful event occurring in a (managed) data system.
- A datastrophe always has a good side!
Data projects: stacking up assumptions
There is a lot of literature about the phases of a data project; although there is no global consensus yet, I will use as an illustrative example the CRISP-DM (originally dedicated to Data Mining):
Note: In a (big) data science approach, we may swap the two first boxes, but as you can see, that wouldn’t change the process dramatically.
In each of these phases, assumptions are made by the different actors in order to create feasibility constraints. Here are some examples:
- Business understanding: students generally don’t subscribe easily to health insurance.
- Data understanding: the data received for the surveying company represents students adequately.
- Data preparation: the age of students is always greater than 18.
- Data modeling: the distribution of the students’ income is bimodal.
- Evaluation: testing the conversion rate is enough.
- Deployment: everything is gonna be fine (sorry for the troll).
Assumptions are generally drawn from experience acquired on the field (or gut feeling), otherwise derived manually (or not) from data (e.g. dimension reduction, feature selection).
In a way, they are part of the knowledge existing or acquired by the actors along the process.
Of course, it is fine to make assumptions, however, they come with downsides; e.g. they might not be robust enough to hold in case of events like, well, COVID-19 for instance.
Let’s take a couple of examples to describe further what assumptions can be made and their consequences.
The case of SQL
A simple case to consider is a SQL query, you are making the assumption that its model will remain unchanged, and anyway, you are not responsible for it… so why bother?
Or maybe, in the event it would change, you assume “someone” will let you know.
Until the first
alteris applied, your code blows, and you’re on the hook for days to find out what to do until production is back up.
What happens the next time? Well, as a matter of fact, you still don’t bother! Of course, it is still not your responsibility, yet, you don’t feel comfortable because you depend on something you don’t manage.
However, if you are accountable for the results you produce there are a few best practices or habits you will develop, such as:
- Logging the schema: to find the cause rapidly or, using a monitoring tool to alert you.
- Testing the schema stability: and stop the application with a warning message, resulting in a comprehensible ticket.
- Reviewing the schema of the table periodically (or ping your DBA): to clear up your mind.
Consequently, you start taking some responsibility for it, therefore you dedicate time to it. Imagine the consequences when projects are piling up and this dedication is spotted.
The case of “AI”
Considering now a more advanced case, most “AI”s are making an assumption that the world is close enough to the world described by the data used during its conception. From those, numerous opaque assumptions are derived and embedded in their prediction capabilities.
It is somewhat similar to the assumption made with the SQL query, with a significant difference that you have little (if not none) idea of what those assumptions are!
I won’t dig this much, as this is not the subject of this post, however, you’ll find at the end of this post a series of articles corroborating it.
Those assumptions are continuously used after the deployment of the project, which implies that “the health of your business depends on their robustness”.
Therefore, you can feel how uncomfortable it is for the project team to “let it go” live.
Datastrophe, a definition
A datastrophe is a catastrophe with data.
A catastrophe is the denouement of a drama.
A datastrophe is the denouement of a DAMA (data management).
Managing data is a wide subject, often misleadingly reduced to data governance only. Data management is the sum of all needed methods and processes to ensure that data is used appropriately, generating value and not risks. I wrote a short report on that subject that has been edited by O’Reilly, check it here.
We can’t reduce management to the only resolution of conflicts and issues, but datastrophes are literally the events supporting the need for better data management. Because issues aren’t showing up in data, but what used it.
Above, I explained how data projects (projects using data) rely on superheroes, the “assumptions”. My point is their supervillains are the datastrophes.
Nevertheless, in the current unconventional era, it is not anymore clear cut that the superheroes are the good team (e.g. The Boys), and the supervillains are the bad one (e.g. Suicide Squad). That is the question…
Therefore, following the movie analogy, I tend to say, like most popular coins, that datastrophes have 3 sides: The Good (Il Buono), The Bad (Il Cattivo), and The Ugly (Il Brutto).
And yes the characters in the title are ordered differently per language. Heh, what did you assume? In fact, I always wondered how many non-Italian speakers assume that Brutto means Bad…
Actually, each datastrophe will have its own 3-sides combo, this is why I’ll dedicate future posts to detail them individually.
However, please keep in mind now and forever that when a datastrophe shows up, it is simultaneously:
- Bad for you: you’ll undoubtedly be pointed out as guilty or responsible even though you may not feel that way,
- Ugly for the group: data intelligence results are used to draw strategic or tactical moves resulting in disastrous events,
- Good for both: please focus on this one, because there is always a light you can seek:
- You can both learn from it and anticipate them in the future,
- You can discover the opportunities opened by the new learning that those assumptions aren’t (always) right.
By the way, without commenting, I’ll let you look at the below picture… it may dismantle some of your, well, assumptions about the characters ;-).
There are other names that were given, depending on which sides of the coin you look at, such as:
- Data issues: data monitoring.
- Data downtime: data observability.
- Dark data: coined by Gartner, means unused data.
- Dark data (my preferred) coined by David J. Hand who lists 14th different DD-types:
- Data We Know Are Missing.
- Data We Don’t Know Are Missing.
- Choosing Just Some Cases.
- Self-Selection is a variant of DD-Type 3.
- Missing What Matters.
- Data Which Might Have Been.
- Changes with Time.
- Definitions of Data.
- Summaries of Data.
- Measurement Error and Uncertainty.
- Feedback and Gaming.
- Information Asymmetry.
- Intentionally Darkened Data.
- Fabricated and Synthetic Data.
Throughout this article, you have learned what datastrophes are, how they relate to assumptions (made by humans or not), and why their management is crucial to ensure the viability of any data projects.
In the following posts, I will introduce examples of datastrophes, with their definition, their context of occurrence, and how they can be handled with a dedicated management system (such as a Data Intelligence Management platform). So… stay tuned ;-).
Are you going to leave yourself without enhanced monitoring capabilities of your data applications, reports, and AI? If your answer’s “no”, don’t hesitate to reach out to me or on our website Kensu.io.