5.1 Missing Data

As with most scientific data, simply because there are errors in what we can gather about social networks does not necessarily negate the usefulness of those data for examining how relational data are associated with—dare I say, causally—other concepts. In regression-based work, we regularly invoke the notion that data missing at random is essentially ignorable (Allison 2002). With qualitative interviews, once analytic saturation is reached, we don’t assume you need to continue interviewing until you’ve reached an exhaustive census of the population (Weller 2018). In the same way, we don’t simply jettison network data, just because it is missing some—even substantial amounts—of what we want to investigate (Kossinets 2006; Huisman 2009). Scholars have come up with a range of strategies for handling missing network data (Neal 2008; Koskinen et al. 2013; Handcock and Gile 2010).

Table 5.1: Missing Edge Data & Reports.
FB . 1 1 0 0 m m 1 1
SG 1 . 1 0 0 m m 1 1
GG 1 m . 1 1 1 m m 1
LE 0 0 1 . 0 0 m m m
GD 0 0 1 0 . 0 0 0 0
AM 0 0 1 1 0 . 1 0 0
BM m m 1 1 m 1 . m m
MB 1 1 1 m m m m . 1
PT 1 1 1 m m m m 1 .

NOTE: ‘1’ denotes tie presence, ‘0’ denotes absence, ‘m’ missing, and ‘.’ is a structural zero.

Let’s begin with a form of missing data that is particularly unique to social networks. If you think about network data in the format it was most commonly stored for decades—the sociomatrix—you can visually conceptualize how many dyads comprehensive network data is essentially assessing (see Table 5.1). The number of potential ties when measuring a single relationship for a population of size N * (N − 1).91 In practice very few studies explicitly address every potential relationship. Instead, we identify some maximal number of relationships for each node, which we then code as those that are present, and assume all others are absent. That is, we have positive confirmation of the presence of ties, and infer absence of ties from the absence of information about them. We generally don’t confirm the 0’s.92

This imbalance is important to keep in mind especially because many of our analytic techniques rely on the patterning of ties that do not exist as much, or even occasionally more so, as on the patterning of ties that are present. Any patterned ways that data may never have been considered are especially important to keep in mind. For example, in the survey context, we are likely to end up with unequal consideration among those ties identified as missing between nodes of different degree. Suppose one of our analytic considerations is the tendency towards reciprocity in a network. If we’re focused on financial exchange data, those nodes that are most actively trading with others will have a higher likelihood of receiving un-reciprocated exchange reports. This could reflect an actual discrepancy in the perceptions of relationships from members on opposite ends of the exchange. Or the more active node may not report on comparably extensive numbers of their trading partners as would less active nodes, if for example some of those nominations would fall outside of the imposed caps in the data collection design. That is, we frequently have no ways to distinguish between those potential partners who were explicitly considered and excluded from one’s nominations (i.e., actual non-partners) from those who were missed because they were not considered (e.g., whether through reporting caps, forgetfulness, or other reasons). We are forced to treat all non-reported ties as equivalently absent, when their absence may stem from very different sources.93

Beyond this unique feature of what truly “complete” data could demand in network oriented studies, there are also network-specific applications of more common missing data questions. Let’s now turn to those. Essentially, the remainder of this section will take up three questions in turn. (1) How common are missing data, and what are the common types of data more likely to be missing? (2) How do these patterns of missingness alter network analytic capabilities? (3) What are the available strategies for handling missing data?