Archive / INF Seminars / INF_2019_05_09_Richard Torkar
USI - Email

Why do we encourage even more missingness when having missing data?


Host: Prof. Carlo A. Furia




USI Lugano Campus, room SI-006, Informatics building

Richard Torkar
Chalmers University of Technology and University of Gothenburg, Sweden


Richard Torkar works as a professor at Chalmers and the University of Gothenburg, Sweden. Richard is mainly conducting research with industry in very varying topics, lately e.g., behavioral software engineering, applications of meta-heuristic algorithms, Markov Chain Monte Carlo diagnostics, and using Bayesian statistics as a foundation for machine learning applications.
Most would argue that in order to conduct estimations one should not rely exclusively on expert opinion, but also on data of a more quantitative nature using unbiased data collection approaches. To this end, researchers have published studies making use of, among others, the International Software Benchmarking Standards Group's data repository (ISBSG). One could make an argument that this data set, and similar data sets, have several things in common with data collected in industry, i.e., missing data, disparate quality in data collection procedures, and variety of data types collected, are issues we see also in empirical software engineering research in general.

The prevailing strategy to handle missing data in empirical software engineering research is to merely remove cases of missing data (listwise deletion). We believe that this strategy is suboptimal and, generally speaking, not good for our research discipline. Even in cases when data can be classified according to the quality of the data collection procedure, as is the case with the ISBSG data sets, one sees that our community often chooses only to use a subset of data, classified to be of the highest quality. In short, we believe that data of low quality should be seen as better than no data at all, and the general rule of thumb should be never to throw away data.

We will present a case where we apply techniques for data imputation in addition to conducting Bayesian data analysis on effort estimation data using the ISBSG data set.