5 Data Quality: Assessment, Implications & Improvements

As with any scientific endeavor, we should not take for granted the potential quality of social network data. While the principles laid out in the preceding chapters identify strategies that potentially allow researchers to capture the relationships of interest in an ethical way for a network study, we also have to consider how well they actually work in practice.89 Luckily, this has been a concern of some social networks researchers seemingly from the outset of the field (Bernard and Killworth 1977). This chapter lays out the many ways that social network scholars have developed for evaluating the data quality in social networks research. Given the myriad complexities within network data, the concern leads to a number of different strategies for assessing network data quality.

At the outset of this chapter, I would like to warn readers new to this field to (1) not assume any one data quality assessment is sufficient to establish the utility of their own network data, and (2) be careful not to allow yourself to be too easily discouraged; some of the empirical studies that follow may appear at first to be more damning to the efforts of the field than will be warranted upon full consideration. That is to say, remember that virtually any data we use are approximations of the concepts we actually care about. And no approximation is error free.

The important question is whether the errors we encounter along the way undermine the analytic aims we have for the data on hand. Some errors are mostly ignorable (e.g., those that are random), or can be readily adjusted for (e.g., through known systematic biases). In those cases, data not directly capturing exactly what we want to use it to represent does not preclude us from being able to reach accurate conclusions.

This chapter is organized around two primary types of data quality considerations: (a) missing data, and (b) data reliability or validity. Within each of these broad categories, subsections will examine:

  1. where network data is of especially good (or poor) quality,
  2. the implications of that quality for claims researchers want to make from their data, and
  3. ways we’ve come up with for improving network data’s capacity for (1) & (2).

Suppose our research question is about sexual partnering patterns among commercial sex workers. How closely can our data get to accurately capturing the population of interest, providing details of which people they engage in sexual activity with, and how frequently? If I want data on international trade, how likely are the databases I can acquire to actually cover all of the types, frequency and value of trade partnerships between the countries I’m interested in? We will first want to (1) descriptively examine how accurately the data available to us capture these relationships of interest.

While descriptive assessment is important, it is rarely the full aim of scientific research. We generally also want to use the data we obtain on any one phenomenon (here social relationships) to offer explanations for—or make predictions about—some other variables (e.g., do networks help account for disparities on health outcomes like smoking or obesity), or vice versa (e.g., how religious participation accounts for patterns in social networks). It’s not always the case that the accuracy (or lack thereof) in descriptive features of network data will necessarily translate directly into implications for their predictive utility. In fact, descriptive errors themselves could undermine, be irrelevant to, or even bolster the explanatory aims in social networks research. As such, each section will draw on these accounts of descriptive patterns to (2) summarize how they have shaped the scientific uses to which network data have been applied.

Diagnosing the descriptive and analytic limitations of social network data occupied scholars’ attention for several decades. Some studies raised cautionary tails about types of data that are especially hard to reliably collect (Bernard, Killworth, and Sailer 1979; Killworth and Bernard 1976). Others pointed out that even in the face of substantial noise, the predictive utility of many aspects of social networks research remained relatively unharmed (Leifer 1992). But as any user of large scale survey data with a desire to use income data can tell you—most researchers are no longer content to know and diagnose the potential pitfalls of data limitations.90 We increasingly seek out solutions—analytic or otherwise—to correct for the limitations that are known to plague even our best efforts at gathering valid data (An and Schramski 2015). Just as multiple imputation has become the coin of the realm for some data that are especially hard to gather in individual surveys, social network researchers have come up with a number of strategies for improving the utility of network data. The third subsection in each topic of this chapter therefore summarizes some of the ways we have come to leverage what we do know to improve our confidence in the analytic capacity of social network data. These will include some imputation suggestions, some model-based approaches to “correcting” existing data, and analytic smoothing techniques that leverage the unique reality that network data is often multiply reported into estimates that counterbalance between the likely sources of error in collected data.