Using data science for better manufacturing

This primer can help professionals make those numbers talk to improve decision-making

Using data science for better manufacturing

By Michel Baudin

Even though manufacturing depends on data to control quantities and quality, enterprises in this sector never have taken the lead in using their data effectively to support managerial or technical decisions. Today, manufacturers still face the challenge of finding simple answers to simple questions about what customers are buying, how those products vary over time in aggregate volume and mix, how well products meet the needs they are designed to fill, where problems are and whether the solutions have worked.

Undoubtedly, direct observation of the shop floor and personal communications with stakeholders provide information that data cannot, but the converse is true as well. What you see with your own eyes and what people tell you is not the complete story. It must be supplemented by data analysis, but the current prevalent practices don’t measure up. Pie charts, stacked bar charts, time series plots, safety crosses and other mainstays of performance boards are not sufficient.

From statistics to data science

For decades, statisticians in manufacturing companies have tried to introduce more sophisticated methods – without much success. In manufacturing, the title of “statistician” does not inspire confidence, to the point that Six Sigma, a 1980s attempt to modernize statistical process control, called its practitioners “black belts” rather than “staff statisticians.” Today, statistics itself, as a stand-alone discipline, is an endangered species that is subsumed under data science, even in academia.

Yale University, which established its Department of Statistics in 1963, changed its name to the Department of Statistics and Data Science in 2017. Columbia University created its Data Science Institute in 2012.

In the United States, 23 universities offer master’s degrees in data science, including, besides Yale and Columbia, Harvard, Stanford, the University of California-Berkeley, Cornell, Carnegie-Mellon and Georgia Tech. Some of these programs are run by current or former statistics departments, as at Yale, Cornell and Stanford. Others, like Carnegie-Mellon’s, are in the school of computer science. Others still, like those at Rutgers or Michigan State, are in the business school. This tells us that data science is taken seriously as a topic by the leaders of academia and that there are divergent perspectives on what it is and where it belongs.

So what is the difference between statistics and data science, and why should you care? If you study statistics, you learn mathematical methods to draw conclusions from data and that’s it. Data science has a broader scope: It starts with data acquisition and ends with finished data products that can be reports, dashboards or online visualization and query systems.

Once data is collected through internet of things (IoT) devices, cameras, transaction processing systems or manual processes, the information needs to be organized, cleaned, stored and retrieved using the technology of databases, data warehouses, data lakes and other variants. This is taught in computer science, not statistics.

Visualization at all stages of analysis is also central to data science but does not get as much respect in statistics as a branch of math. While renowned statisticians like John Tukey have worked on it, the art of producing charts and maps is not the core of the statistics curriculum and is dismissively called “descriptive statistics,” because it does not usually involve deep math. However, in data science, visualization is the key to identifying patterns in data, validating models and communicating results. It is an integral part of the process – in other words, without visualization, data science is of little practical use.

Among the many terms currently applied to the art of analyzing data, data science is most descriptive of the field as a whole. You also hear of data mining, machine learning, deep learning or big data, all of which are often conflated but describe subsets or applications of data science. Strictly speaking:

  • Data mining is the analysis of data collected for a different purpose, as opposed to design of experiments (DOE), where the data is collected specifically for the purpose of supporting or refuting a hypothesis.
  • Machine learning is what is done by algorithms that become better at a task as new data accrues. For example, a neural network may be designed to recognize a handwritten “8” and to improve its performance with experience.
  • Deep learning doesn’t mean what it says – acquiring deep knowledge about a topic. It designates multiple layers of neural networks where each layer uses the output of the layer below.
  • Big data refers to the multiterabyte data sets generated daily in e-commerce, from click throughs to buying and selling transactions. Manufacturing data sets don’t qualify as big data. True big data is so large that it requires special tools, like Apache’s Hadoop and Google’s MapReduce, and I have never heard either mentioned in a manufacturing setting.

Data science is a broader umbrella term that is, if anything, too broad. Taken literally, it could encompass all of information technology. As used in most publications, data science does not cover data acquisition technology but kicks in once it has collected data and it produces human-readable output to support decisions by humans. Data science does not include the use of data to control a physical process, as in 3-D printing, self-driving cars or CNC (computer numerical control) machines. The exceptions include Li-Ping Chu’s book Data Science for Modern Manufacturing, which is all about how manufacturing should be and perhaps will be. Until then, it is the way it is, and data science, as understood here, is helping to make it better.

Data wrangling/munging

The analytical tools used in data science receive the most media attention but are not where data scientists spend most of their time. Instead, while estimates vary and these are not accurately measured, the bulk of their effort is spent preparing data.

Ideally, this shouldn’t be happening; in reality, it does. The company’s systems should be able to produce tables with column headers like “Product ID,” “Serial number,” “Completion date,” “Color,” etc., followed by rows of values that an analyst can select, filter, join with other tables, summarize and transform to find answers.

The integrated system that would provide this is still in the future, and may stay there, for nontechnical reasons. To date, it has not been possible for any software supplier to develop a single system with modules for all manufacturing activities from engineering change control to maintenance and quality that could outperform specialized systems for each function.

There is no technical obstacle, but human dynamics of the software industry have kept such systems out of existence. The dominant providers of enterprise resource planning (ERP) products all started by being successful at one function – like multicurrency accounting or human resource management – and expanded from there into domains in which they neither had expertise nor the ability to recruit the best experts, and their specialized modules are generally not competitive with stand-alone systems developed by domain experts.

Short of having a single, all-in-one system, you might configure different systems to play together well. This would require them to have the same names for the same objects in all systems, consistent groupings and consistent relationships for products, processes, operations, equipment, critical dimensions and people. The systems could then collaborate and feed usable extracts to analysts. The development of such a common information model, however, is not usually high on a manufacturing manager’s to-do list.

The prevailing reality is a multitude of legacy systems used by different departments, supplemented by individual spreadsheets. The same product goes by different names in engineering, production, marketing and accounting, and the products are grouped by technical similarity in engineering, volume class in production, market segment in marketing and business unit in accounting. Not only is the same object known by multiple names, but supposedly unique names are used for several different objects.

The names are “smart” numbers, a legacy of the paper-and-pencil age, where, for example, you know that the product is blue because the fifth character of the name is “1,” and green would be “2.” In addition, the most valuable information, like a technician’s diagnosis of a machine failure, is often only available in free-text comments.

And then there are missing values. In addition to the problems with the systems officially supported by the IT department, the individual spreadsheets contain tables with missing rows due to incomplete copy-and-paste operations and errors in formulas.

The most common management response is to declare defeat.

“It will be fixed when we implement a new system in two years,” they say.

Or they give up on this plant and promise to do better in the next one to be built. Not only does giving up fail to provide answers to today’s questions, it also fails to prepare the organization to specify, acquire and implement new systems in the future.

Just as continuous improvement is necessary in existing production line layouts, workstation design or logistics to learn how to design new ones, it is necessary with information systems. And this translates to an organized, sustained effort to make the existing systems useful in spite of all their flaws and low data quality.

When a machine is no longer a machine

Big data has the power to transform manufacturing’s concept of what a machine is, according to “Edge Powered Industrial Control- Concept for Combining Cloud and Automation Technologies,” authored by Christoph Pallasch of Aachen University and his colleagues.

As reported by, which covers technology and business, the research posits that adding massive computing power to formerly self-contained industrial networks will enable “resource-intensive data processing and feedback control loops directly from the shop floor. Industrial software and control algorithms running on the field or edge level can be versioned and exchanged during operating time, allowing direct update of control logic or adaptation of production parameters.”

The resulting hyper-connectivity from this “machines-as-a-service” paradigm could allow manufacturers to synchronize different stakeholders along the value chain of production.

“The benefit would be a highly optimized and timely coordinated production chain with reduced downtimes,” Pallasch and his colleagues write.

The researchers’ demonstration project included low-cost commodity hardware linked to robotic arms and connected to the internet.

“Data obtained and saved in a storage is first processed by analytics apps running in the cloud,” they explained.

Interconnected and intelligent factories and production systems are a hallmark of manufacturing’s future, often dubbed Industry 4.0. As Manuel Grenacher, CEO of Coresystems wrote in a recent Forbes Technology Council post, companies won’t purchase just nuts-and-bolts machinery. Instead, they will partially finance payment based on the machine’s output, and the contract will include subscriptions to cloud-based analytics that will ensure the machine continues to drive and enhance the business.

The query tools of relational databases are the workhorses of data wrangling, but they are not sufficient, as data do not always come in tables but sometimes in lists of name-value pairs in a variety of formats like JSON or XML that first must be parsed and cross-tabulated. You also need more powerful tools to split “smart” part numbers into their components, identify the meaning of each component and translate values into plain English. And you need even more sophisticated text mining tools to convert free-text comments into formal descriptions of events by category and key parameters.

It doesn’t work perfectly. You may be able to recover only 90 percent or 95 percent of your data, but then you not only have a clean data set but also a set of wrangling tools that can then be incrementally applied to new data and enrich this data set, which begs the question of where to keep it. A common approach is to use a special kind of database called a data warehouse, into which you load daily extracts from all the legacy systems after they have been cleaned and properly formatted. They can then be conveniently retrieved for analysis.

The part of the data warehouse that is actually used for analysis may be a small fraction of its content, but you don’t know ahead of time which fraction. As a result, most of the data that is prepared and stored in the warehouse is never used. This has motivated companies with very large data sets, as in e-commerce, to come up with another approach called the data lake, into which you throw data objects from multiple systems in their original formats and prepare them for analysis if and when you have established that they are needed.

Whether a data warehouse or data lake is preferable in an organization is a question of size. With small data sets, the penalty for preparing all data is small when weighed against the convenience of having it ready to use.

Analyzing the data

With clean data, you are finally at the statistician’s starting point. The first step is always to explore the data with simple summaries and plots of one or two variables at a time, and this is often sufficient to answer many questions. Being a good data scientist is about making the data talk, not about using a particular set of tools.

Data science training leaves you with a box full of tools that you don’t necessarily know what to do with, bearing names that are not self-explanatory like k-means clustering, bagging, the kernel trick, random forests and many others. They were developed to solve problems but, to you, they are cures in search of a disease and answers to questions you don’t have. The topical literature fails to answer the three questions British consultant John Seddon recommended asking about any tool:

  1. Who invented it?
  2. What problem was he or she trying to solve?
  3. Do I have this problem?

In data science, when a tool was invented is also essential because its use requires information technology. The tools of the 1920s rely on assumptions about probability distributions to simplify calculations; the ones from the 1990s and later require fewer assumptions and involve multiple simulations.

You find out, for example, that logistic regression has nothing to do with moving goods and was invented in 1958 by David Cox to predict a categorical outcome from a linear combination of predictors that can be numbers or categories. In manufacturing, it will tell you how relevant the variables and attributes you collect in process are to a finished unit’s ability to pass its final test. If they are not relevant, you may stop collecting them and can look for better ones; if they are relevant, you can modify the final test process to leverage the information these variables provide. Logistic regression can also be used to improve binning operations.

That it’s from 1958 tells you that using it on a data set with 20,000 points and 15 predictors is unlikely to overtax a 2018 laptop or tablet. In this particular case, the name of David Cox does not add much information because he was a theoretician, as opposed to others who worked on specific applications, like W. Edwards Deming in manufacturing quality or Brad Efron in epidemiology.

You may ask what your problems have in common with epidemiology. Not only are you likely to find that you have no use for many of the tools in the published data science toolboxes but also that you have problems none of them address. Whether it is about demand, bookings and billings or technical product characteristics, manufacturing data come in the form of time series. There are many tools for visualizing, analyzing, modeling and controlling time series, but they are just off the data science lists.

Once you have established that a tool may be useful to you, you need to learn how to use it. You don’t need to plough through the underlying math any more than a car driver needs to understand the theory of engines. It can remain a black box to you, but you still need to know how to feed it data, what the various settings do and how to interpret the output. By itself, this is not a trivial investment in time and effort and needs to be done selectively.

Presenting results

The presentation of results to stakeholders who are not data scientists is past the statistician’s end point. The results are moot unless they can be communicated to decision-makers in a clear and compelling fashion.

The art of generating reports, slide sets, infographics and performance boards is not taught in statistics courses and not covered in statistics textbooks. It is often entrusted either to engineers who are poor communicators or to graphic artists who do not understand the technical content and produce charts that decorate rather than inform or persuade.

In business, the report, with a narrative in complete sentences and annotated charts, is a dying art, replaced by the slide set with bullet points that are not sentences and graphics that are limited to 3-D pie charts and stacked-bar charts. When reports are produced, they are expected to fit on a single A3 or 11-by-17-inch page.

This works for many activities, but data science isn’t one of them. With slides and A3s alone, you can gloss over gaps in logic that would be exposed in report writing and prompt authors to fill them. Slides and A3s are useful, respectively, as visual aids for oral presentations and as summaries, and as a supplement to a fully baked, objective and rigorous statement of analysis and results, expressed in layman’s terms and with all appropriate nuances and caveats.

That executives are “too busy” to read reports is only true for reports that haven’t been designed to be read by busy executives. An executive always has the time to read a one-page summary – possibly an A3 – and spot-check the research behind the conclusions at three locations within the report. Reading it cover to cover is not usually necessary, particularly if the report has been designed with this use in mind.

The communication of data science is heavily graphic. Rather than limit themselves to a small set of standard charts that have been used in manufacturing for a century, engineers should expand their horizon, use more types of charts, embed them in infographics and leverage the insights of a researcher like U.S. statistician Edward Tufte. In addition, when a report is produced in electronic form, illustrations are not limited to still images. Swedish statistician Hans Rosling’s Trendalyzer, for example, has an animation that shows a scatterplot changing over time. A histogram can also come with a slider bar to allow the reader to instantly see the effect of changing bin sizes.

The reports that are vanishing in business live on in academic papers, with abstracts in place of executive summaries. In many fields, these papers are, in fact, data science reports, and they are not without challenges. First, academia’s review process does not always work. “Growth in a Time of Debt,” for example, an influential 2010 paper by Harvard economists, was exposed in 2013 by students as containing calculation errors.

Second, when an academic paper is cited, the conclusions are often amplified beyond recognition. This is how a lighting study conducted on just five women assembling relays at Western Electric’s Hawthorne plant in the late 1920s spawned the belief in a “Hawthorne effect” that makes all the workers of the world more productive when management pays attention to them.

Data scientists cannot prevent journalists, politicians or even work colleagues from oversimplifying and distorting their work, but it behooves them to speak up when it happens. They are responsible for the quality of the work, including not only sound analytics but effective communication as well.

Better tools, better data, a better future

The software toolkit of most engineers and managers in manufacturing is limited to Excel and PowerPoint, with the addition of Minitab for Six Sigma black belts.

These options don’t cut it for data science, and there are plenty of options for all stages, from data wrangling to analysis and presentation. Some tools are free, powerful and reliable but require a high level of skills from users. Others are “for everyone” and available for fees. Regardless of what data tools you choose, the main investment is in learning to apply them. In that respect, data science and its tools are analogous to the manufacturing sector’s production machinery.

Michel Baudin runs the Takt Time Group, a network of international lean consultants. Clients include Honda of America, Dell, Raytheon, AGCO and Schlumberger. He has written four books, numerous papers and has taught courses about lean manufacturing. He has a master’s degree from Mines-Paristech and has done graduate work at the University of Tokyo.