5.1.1 Imputing & Modeling Missing Data

This leads us to consider the strategies that network scholars have devised for addressing missing data. For a number of years, the short answer has been that we take the data we have at hand as the best representation of reality available to us, and proceed with whatever analytic aims we might have. In short, this approach largely ignores missing data, and the potential ways it may shape the interpretations stemming from our analyses. Obviously, this is sub-optimal, especially if levels of missingness are likely to be high.98

Before turning to the solutions that have been developed for working with missing data analytically, it is important to highlight that—as with most data collection endeavors—one of the best solutions available is in the data collection phase. Good, old-fashioned, shoe-leather based approaches that build rapport among a research team and the members of their study population can go a long way in reducing how common some of these limitations are, even in settings where data would seemingly be especially difficult to gather (Potterat et al. 2004).

With data in hand, there are broadly three approaches that have been proposed for handling missing data to potentially minimize its analytic effects: simple imputation, model-based, and cognitive social structure approaches.

The first approach has been to take the imputation strategies available broadly within the social sciences (e.g., ) and adapt them to the social network context. These approaches rely on other attribute-based, or simple network structural features (e.g., the degree distribution) to estimate unknown information from known features. These possibilities include simple mean substitution (e.g., for missing attribute or composition data), using random draws from the observed distribution to fill in missing observations (e.g., in nodal degree from the global distribution), or through “network reconstruction” which relies on a strategy for symmetrizing unreciprocated data (Huisman 2009). These approaches work best when the primary source of missing data stems from non-response.

A second strategy–model-based approaches–aims to make use of the network aspects of network data to fill in where data are missing. These model-based approaches have variously drawn on exponential random graph model (ERGM) and Bayesian oriented approaches.

As one example, scholars have elaborated methods for ERG models to adapt network data including nominations to alters outside the population (i.e., design effect missing data) to be included into modeling estimates (Robins, Pattison, and Woolcock 2004). This work has shown how excluding out-of-sample nominations may lead to misinterpretation of select model parameters (Gile and Handcock 2017). A particularly interesting recent extension of these sorts of models has been to examine the possibilities of estimating global network structure from egocentric data (Smith 2012, 2015). Bayesian approaches have a history for assessing network data inaccuracies (Butts 2003). More recently, these have been adapted as a data augmentation strategy within statistical modeling frameworks (such as ERGMs). Estimates on suspected terrorists are reasonably expected to be subject to high levels of missing data. Scholarship examining this trend show that such Bayesian approaches can recover tie-level information that is substantially missing, while attribute-level data is also recoverable but to a lesser degree. However, using these methods to recover missing attribute data requires the strong assumption that observed homophily largely operates similarly for those with missing attribute data (Koskinen et al. 2013).

This work even explores the potential for identifying nodes that have not been identified in the observed data. Similar extensions of such approaches have even been adapted for interval-level relational data (Jorgensen et al. 2018).

In addition to these model-based approaches, I want to highlight one additional decidedly network-based approach to estimating missing network data. In a classroom-based study, Neal (2008) was only able to gather information from 15 of the 23 students in one class. However, in addition to traditional network data elicitation approaches, she also employed a modified version of gathering information on the cognitive social structure from the participants (Krackhardt 1987). I.e., she had information from each participant in the classroom about their perception of the relational structure among all students in the classroom. She then demonstrated a number of potential rules for inferring the presence/absence of ties from those aggregate reports, and used the resulting data to supplement the interpretation of network structural features among all students in the classroom, which would have been impossible if relying only on the self-report data available (Neal 2008). This approach is a specific example of a more general strategy that considers edges as latent constructs that are estimable from multiple reports (Koehly and Marcum 2018; Hlebec and Ferligoj 2002).