The industry-wide neglect of data design and data quality (and what you can do about it)
My favorite way of explaining the difference between data science and data engineering It is this:
If data science is “making data usable,” then data engineering is “making data usable.”
These disciplines are so exciting that it’s easy to get ahead of ourselves and forget that before we can make data usable (let alone useful), we must make data first of all.
But what about “making data” in the first place?
The art of making good data is terribly neglected. If you don’t have data, or input, to work with, then there isn’t much your data engineers and data scientists can help you with.
But even when you do have some data, there’s a chance you’re missing something: data quality. If you’ve collected truly stale data, forget about extracting value from it. It is useless to fight against the inescapable gravity of this basic law of nature: garbage in, garbage out.
Data plays the same role in data science and AI as ingredients do in the kitchen. A fancy kitchen filled with all the latest gadgets won’t save you; if your ingredients are junk, you might as well give up. No matter how you slice and dice them, you’re not going to cook anything worthwhile. That is why it is necessary to think about investing in good data before you jump headlong into your project.
If you care about results, invest in good data before looking for fancy algorithms, models, and a parade of data scientists.
Let me guess a little about you, dear reader: you’re not new to Garbage In, Garbage Out (GIGO). Or QIQO for the most optimistic personalities who have a glass half full (Q is for quality). You’re practically begging me to say something you haven’t heard before, but here I am, irritating your patience with GIGO chatter. Again. Yes, we have all repeated the GIGO principle over and over again. I’m at least as sick as you.
But guess this. If we have an entire industry of professionals who respect GIGO and also understand that designing quality data sets it is not trivialwhere is the evidence that we put our money where our mouths are?
If data quality is so obviously important, after all, it is the foundation of all billionaire data/AI/ML/statistics/analytics shebang: what do we call the professionals who are responsible for it? This is not a trick question. All I want you to tell me is:
What is the *title* of the person whose primary function is the design, collection, preservation, and documentation of high-quality data sets?
Except, unfortunately, it can also be a trick question. Every time I’m talking to a group of data people at a conference, I try to sneak in the question. And every time I ask them who is responsible for data quality in their organizations, they have never reached anything remotely resembling a consensus. Whose job is it? Data engineers say data engineers, statisticians say statisticians, researchers say researchers, UX designers say UX designers, product managers say product managers… GIGO ad nauseam. Data quality seems to be exactly the kind of “everyone’s job” that ends up being nobody’s job, as it requires skills (!), but no one seems to be intentionally investing in it, let alone sharing best practices.
Data quality is exactly the kind of “everyone’s job” that ends up being nobody’s job.
Maybe I care too much about the data science profession. If I was here just for my own career, I’d make a quick buck with data quackery, but I want data races in general to matter. To be worth something. Be useful. To make the world better than we found it. So when I see the two most important prerequisites being neglected (data quality and data leadership), it breaks my heart.
If he {data quality professional / data designer / data curator / data collector / data manager / data sets engineer / data excellence expert} the career doesn’t even have a name (see?) or a community, no wonder you can’t find it on a resume or college program. What keywords will your recruiters use to search for candidates? What interview questions will you use to assess basic skills? And good luck finding excellence – your candidate will need a great symphony of skills.
What keywords will your recruiters use to search for candidates? What interview questions will you use to assess basic skills?
First off, let’s acknowledge that we’re not talking about your little cousin’s “data labeling” summer job here, the kind of job that involves mindless data entry and/or picking out all the cupcake shots among a miniature bakery purgatory and/or go door to door with a paper survey. Thought I’d mention this because “isn’t it just the data labeling?” is a question I’ve been asked several times in a tone of polite concern for my blood pressure. What a way to write off an entire category of geniuses.
“Isn’t it just the data labeling?” No. (What a way to dismiss an entire category of genius.)
No, we’re talking about the type of person who designs that. data collection process first of all. It takes at least a pinch of user experience design, a pinch of decision sciencea dollop of survey design experience, a bit of psychology, a dollop of experimental social science with field experience (anyone with real experience will anticipate the Philadelphia problem for you in his dream), and a piece of Statistics training too (although you don’t need a full statistic), plus solid analytics experience, lots of domain experience, some project/program management skills, some exposure to data product management, and enough data engineering background for thinking about data collection at scale. This is a rare combination – we are in dire need of a new specialization.
To have any hope of building a mature data ecosystem, we must give the new generation of specialists a good home where they will be appreciated for demonstrating their specialized skills.
But until we have fought for a career in data creation that is well recognized, well managed, and well rewarded, we are stuck. Budding badass with an aptitude for this variety of abilities would be lemmings to jump into. It’s kind of a basement desk job these days, if it’s a job at all. To have any hope of building a mature data ecosystem, has to Give a new generation of specialists a good home where they will be appreciated for demonstrating their specialist skills.
So what can you do?
If there are already people with these skills and talents who, despite a history of neglect, are stepping up in your organization to take ownership of data quality, are you encouraging them? Are you taking care of them? Are you rewarding them? I hope you are. Whereas if you’re creating incentives to chase paychecks in Buzzy MLOps or PhD-spangled data scienceyou are shooting yourself (and our entire industry) in the foot.
Google’s People + AI Research (PAIR) team recently released the Data Card Manual to help educate the community on data design, data transparency, data quality, and data documentation best practices. I am very proud of our work and excited that these materials are freely available for the benefit of all, but there is still much to learn. If you too are on this path and passionately advocate for data excellence, please share the lessons you are learning with the rest of the world.
If a research paper falls in a forest and nobody uses it, did it make a sound? It’s a long journey from good ideas to an established discipline of excellence…a journey that needs all the animation and amplification it can get. If you believe in this and can inspire even one other person to take it seriously, you will have played a vital role in building the future. Thanks in advance for spreading the word.
Our community has done a great job celebrating data scientists. We’re doing a decent job of celebrating MLOps and data engineers. But we’re doing a pathetic job of celebrating the people on whom all other data careers depend: the people who engineer data collection and are responsible for data excellence, documentation, and curation. Perhaps we could start by naming them (I’d love to hear your suggestions) and at least acknowledge that they’re important. From there, will we progress to train, hire, and appreciate them for their specialized skills? I sure hope so.
If you had fun here and are looking for a comprehensive course on Applied AI designed to be fun for beginners and experts alike, here is the one I made for your amusement:
PS Have you ever tried hitting the clap button here on Medium more than once to see what happens? ❤️
Here are some of my favorite 10 minute tutorials:
Although the site emphasizes data documentation and AI (gotta get that zeitgeist), the Data Card Manual it’s so much more. It’s the strongest set of general data design resources I know of. Advance:
Let’s be friends! you can find me at Twitter, Youtube, substackand LinkedIn. Interested in having me speak at your event? Wear this form contact.