5.1.0.1 What Data are Likely to be Missing?

Within any given network, data can be missing at multiple levels—nodes, dyads, or edges.94 As may be clear from the above, it’s not as apparent in social networks research how readily we can assess how much data are actually missing as is the case for individual-level attribute data.95 What is more commonly assessed in the literature is what aspects of the data collection process in particular are likely to lead to it being missing, and what types of data are more (or less) likely to be missing than others. The various components of network data collection described in Chapter 2 lead to possibilities of data missing from non-response, boundary-specification truncation, or censoring arising from degree patterns (Kossinets 2006).

Figure 5.1: Potential Sources of Data Missingness. These two adjacency matrices correspond with a heuristic representation of potential sources of missing data (shaded regions) common in panel network data collection. (1) corresponds to dyad-wise missing data (node i failing to report on node j). (2) represents non-response, i.e, someone in the sample providing no alter information (node i provides no alter information). (3) corresponds with the inconsistency of reporting on out-of-sample alters. Some respondents may include them among their alters, while others will not. (4) indicates that out-of-sample nodes (even if named as alters) will provide no information. NOTE: This figure follows the matrix notation convention in SNA that rows correspond with senders and columns with receivers of nominations.
Figure 5.1: Potential Sources of Data Missingness. These two adjacency matrices correspond with a heuristic representation of potential sources of missing data (shaded regions) common in panel network data collection. (1) corresponds to dyad-wise missing data (node i failing to report on node j). (2) represents non-response, i.e, someone in the sample providing no alter information (node i provides no alter information). (3) corresponds with the inconsistency of reporting on out-of-sample alters. Some respondents may include them among their alters, while others will not. (4) indicates that out-of-sample nodes (even if named as alters) will provide no information. NOTE: This figure follows the matrix notation convention in SNA that rows correspond with senders and columns with receivers of nominations.

When eliciting social network data, whether via survey or other methods, non-response is more common that in other data collection strategies, and is especially more likely to reflect false-negatives than false-positives (Marcum, Bevc, and Butts 2012). This can arise for any number of reasons, not least of which is simple limitations of recall. No matter what the data collection strategy, some limitations on the number of partners elicited are bound to exist, and for some relationships that is likely to lead to more missing data (e.g., acquaintances) than for others (e.g., close confidants), which can lead to varying expectations of error depending on the size of one’s network neighborhood to be estimated (McCarty et al. 2001). To illustrate, see the ’1’s in Figure 5.1, noting that the respondent at the top of the figure has more of these than others, perhaps corresponding to their popularity as a named alter among the rest of the population. This can lead to simply missing some proportion of one’s network. For example, Brewer (2000) shows that in longitudinal studies, when examining network churn, a substantial proportion of those dropped among nominations from one wave to another are simply the result of forgetting to name some partners consistently. Some of this limitation can be overcome by pre-loading previous waves of data for use as question prompts (e.g., “are you still in contact with NAME?”") in subsequent data collection rounds (Brewer 2000). This sort of forgetting can also be subject to people’s organization of their understanding and recall of networks at least partially in line with some of the known features of their structure, including formal organizational principles (e.g., hierarchies), or structural features such as clustering, if the reality doesn’t match those patterns (Brashears 2013). This can be particularly limiting if the number of ties elicited restricts those collected about each node to be limited only to those within a small selection of the domains their network partners would span. This potential limitation is one reason some researchers have leveraged to argue for the elicitation of large numbers of alters (e.g., 45+) from each respondent, to be sure that the results span our primary relational domains (McCarty 2002). Additionally, while individual’s missing records in a random population based survey can easily be excluded when certain conditions (e.g., missing at random) are met, in network studies, person-level non-response (see the ’2’s in Figure 5.1) can be limiting in other ways. Missing all of a person’s data could alter estimates of any number of network statistics, both at the local (e.g., reciprocity or transitivity) or global level (e.g., distance).

How researchers define node or tie level restrictions pertaining to the boundary specification problem can also substantially shape the missingness of resulting data. Straight forwardly, if the defined boundary is particularly porous in the context under study (e.g., a study of workplace influence where many innovative ideas come from outside the organization), by restricting the focus to within the organization, many of the ties of theoretical interest may be excluded by design (see the ’3’s and ’4’s in Figure 5.1). However, even less clear demarcating lines can have similar influence on how much data is likely to be missing. For example, in a study of sexual behaviors, basing the decision of which partners to include on either numerical (e.g., “tell me about your last 5 partners”) or temporal thresholds (e.g., “tell me about your partners from the last 6 weeks”) can lead to different expectations from person to person of how that will bound their resulting data. This is why there are frequently no singular “rules of thumb” about how the BSP is resolved across all studies. Instead we derive principles upon which to base such a decision, and allowing the theoretical and empirical demands of each particular case make the adjudication within each study’s design.

The above also raises the possibility of missingness in network data that is particular to longitudinal studies. Fortunately, there is no reason to believe that any of the patterns described here (missingness due to non-response, BSP, or censoring) are more likely within any wave of data collection of a longitudinal project than in a corresponding cross-sectional design. However, given the more frequent focus within network studies on bounded populations, any turn-over in the population is more likely to alter how these patterns shift over time. These patterns of churn can generate missingness that is not simply addressed with the methods for censoring-based adjustments in individual level data. For example, node attrition (present in wave 1 but absent in wave 2), or in-migration (absent at wave 1, but present in wave 2) can lead to complicated patterns of edge-level missing data.96 Many of the strategies for protecting against the limitations of these tendencies are handled within the particular modeling strategy, rather than having universal recommendations for data collection itself (Snijders, Bunt, and Steglich 2010; Almquist and Butts 2014). One exception is that if people are likely to exit a population in structured ways (e.g., through graduation in school-based data), individuals would be more likely to still appear among nominees even after their removal from the sample precludes the possibility of them providing any nominations (Huisman and Steglich 2008). To illustrate this possibility, note the increase in the number of ‘2’s from Time 1 to Time 2 in Figure 5.1, and also note the additional ’3’ column where these same people may also be excluded as potential alters in such a design.

Additionally, any caps in data collection procedures can lead to patterning of missing data, especially when combined with observed differences in degree within the population. That is, high degree actors are likely to have some aspects of their networks both over- and under-reported. The large number of alters that can nominate and provide additional information about the “hubs” in a network,97 means information about this node is more likely to be found in the data. However, if in turn we truncate the number of alters we gather information on for each ego, the corresponding level of information about each of their alters may be under-represented in the data. At the simplest level, high degree actors are likely to reciprocate fewer of the nominations they receive, leading to potentially (artificially) reduced estimates of reciprocity at the population level—if the number of people who nominate them exceeds the threshold of outgoing ties that could be recorded for them. Similarly, data are more likely to be missing from both those who are most central, and most peripheral in a network, a pattern that is more pronounced in more dense networks (Borgatti, Carley, and Krackhardt 2006). Given this common pattern of missingness, scholars have regularly focused on the effects of missing data on centrality estimation, which will be discussed in more detail below.

It would be inaccurate to assume that these sorts of limitations are only present in survey-based research. In an important study of the potential biases introduced by the various APIs available for gathering Twitter data, González-Bailón and colleagues (2014) compared a variety of sampling strategies. They found similar problems of truncation that introduced tie- and network- level biases that were consistent with each of the patterns described above, and are more common in “mention” than “follower” networks.