Perhaps the most thoroughly assessed aspect of social network data’s potential analytic limitations from the data considerations raised above is how readily various measures of network centrality stand up to missing or biased data.
As an example of this sort of work, Costenbader and Valente (2003) examined how eight different studies’ centrality calculations were subject to limitations depending on sampling coverage. To assess the limitations of a variety of centrality measures’ robustness to various levels of missing data, they developed a bootstrap method that is broadly adapted by many subsequent researchers. Essentially, they take the observed network as the “ground truth,” then sample from that data at various thresholds, and observe how re-calculating the same metrics on the sampled data would alter their respective estimates. The punchline (see Figure 5 in the original) basically shows substantial variability in the various metrics ability to withstand increasing levels of missing data. More locally-oriented measures of centrality are more robust to missing data than are more globally-oriented measures, and the more missing data, the worse measures tend to perform (Costenbader and Valente 2003). Some measures are differentially influenced depending on global features of the network (e.g., missing data are more detrimental to dense networks when examining radial measures of centrality than other types). Additionally, in studies where sampling coverage is able to map a more complete representation of the population’s network, these bootstrapped approaches suggest measures are more robust than in data with less complete coverage.
Later scholars took up and extended this work in a number of important ways. While Costenbader and Valente’s approach essentially captured what happens if particular nodes were missing from the data, Borgatti, Carley, and Krackhardt (2006) additionally examined the robustness of centrality measures to other forms of “imperfect data” including the random deletion of edges (instead of nodes), and the potential effects of errors of commission that lead to falsely added node or tie-level information. They concluded that “the accuracy of centrality measures declines smoothly and predictably with the amount of error” (Borgatti et al., 2006:124). Smith and Moody (2013) extended this work further by examining substantially larger networks, finding similar patterns, with more design-based variability, and examining a number of other statistics beyond centrality (e.g., transitivity is relatively robust, but measures of centralization are quite sensitive to missingness). Kossinets (2006) performed similar analyses on data generated as 2-mode (affiliation) networks, focused on estimates of transitivity, finding that node-level missingness lead to overestimates, while edge-level omissions increased measurement error. While these papers’ collective findings highlighted some of the limitations of less than complete data, they also were relatively encouraging in that most of the limitations identified across them were sufficiently predictable. This makes it possible for scholars to develop strategies protect against and correct for these limitations statistically, or through targeted follow-ups of certain missing data in the firle. It also points to the possibility of providing confidence intervals around many of our estimates—assuming data are missing at random.
Unfortunately, scholars also are increasingly aware that network data are rarely missing at random. Smith, Moody and Morgan (2017) therefore took up a similar analysis, but allowing data to be missing in ways associated with a variety of local and global network measures. Focusing particularly on the empirical observation that data are more likely to be missing from highly central or peripheral actors, it’s potentially troubling that centrality-based error is more likely to introduce bias across a range of measures. They also find smaller, undirected networks are more subject to measurement bias (see Table 12 in original). While this is less generally encouraging from the studies assuming data are missing at random, they provide a useful tool that allows researchers to estimate the potential bias in their measures given certain assumptions about network size, coverage, and types of errors expected.113