Have you considered !
Introduction
In the past decade, the InternetofThings (IoT) paradigm has seen an explosion in its adoption by businesses across continents and industries^{1}. The number of IoT devices worldwide is forecast to almost triple from 9.7 billion in 2020 to more than 29 billion in 2030^{2}. This burgeoning success has been made possible by the increasingly affordable and accessible lowpower compute platforms. These platforms have fueled the growth of edgeAI, bringing computationallyexpensive AI methods to the network edge^{3,4,5}. A major driving force behind training/inference of deep neural network (DNN) models on the network edge is the advantages they provide in latency, bandwidth, energy efficiency, privacy, and security, relative to traditional cloudbased approaches^{6}. The edge computing paradigm primarily requires collecting data from various sensors. Cyberphysical systems (CPS) also involve sending actuation signals to multiple devices in a physical environment. Other applications, where edge computing has made significant strides, include smart healthcare^{7}, nuclear power plants^{8}, smart grids^{9}, and autonomous vehicles^{10}, to name a few. However, corrupted sensor data or partiallyprocured/missing data plague these applications. Recently, DNNbased approaches have shown promise in effectively imputing missing data^{11}. However, as we show in this work, even stateoftheart DNNbased methods become ineffective when edgespecific corruptions are present (e.g., where output labels may be missing even when all input feature values are available, or when some feature values may be missing). We propose a novel interleaved trainingandimputation approach, leveraging a DNNbased surrogate model to reliably impute the corrupted data (this includes missing data). We also propose unconventional methods to mimic data corruption, going beyond traditional techniques, to be more in accordance with corrupted data found in edge applications. We show that our imputation framework outperforms baseline methods on corrupted data synthesized through traditional and proposed corruption techniques.
Challenges
Imputing corrupted/missing data is a challenging problem (we use the words corrupted and missing synonymously in this article; notanumber, or NaN, values are often used to report missing data in the literature, and in the context of edge applications, we assume that which data are corrupted is known a priori through signal processing or other methods^{11}). Missing data may be outofdistribution relative to observed data, making it hard to predict the missing values^{12}. This calls for generalizable models that can reliably impute the missing data. The imputation algorithm should be able to learn the underlying datageneration process (thus forming a surrogate model for this process) to effectively predict what data would be observed if they were not missing. Traditional methods typically implement interpolations on observed data^{13,14}. Recent DNNbased approaches have shown substantial gains, but are restricted to either input feature imputation or output label prediction, limiting them to only specific scenarios^{15,16}. In multiinput/multioutput regression datasets, it is possible that both the input and output features are corrupted, and thus only partially available. In this context, we need to impute not only the input but also the output features.
Motivation
Corrupted data are commonplace in edge applications. Data can get corrupted in a variety of ways. In a distributed compute setting, network congestion can cause some data to reach late, resulting in some data becoming stale. Sensors may die due to a multitude of reasons—malfunctioning hardware, intermittent power supply, and even humanintheloop accidents^{12}. Sensors and other edge devices are also prone to security attacks that may cause parts of the network to shut down or transmit malicious or corrupted data. Missioncritical edge deployments exacerbate this problem, where data corruption could hamper operation. Consider the following examples.
The first example is a chemical plant. There have been more than 50,000 reported hazardous chemical incidents in the last decade in the USA^{17}. In chemical plants, where the formation of combustible gases is highly likely, it is important to quickly and reliably detect the appearance of such gases so that relevant action can be taken to alleviate their ill effects. For this application, we use the ‘Gas’ dataset^{18}, which involves a mixture of different gases. The second example is a water distribution system that may be used in a nuclear power plant. As such facilities get smarter, it is important to quickly detect attacks on them to reduce the chances of largescale calamities. The number of attacks on CPS is increasing by the day. Just in the first half of 2021, there were 1.5 billion IoT/CPS breaches reported^{2}. These could adversely affect highstakes organizations and facilities like nuclear plants. Thus, it is crucial to detect whether an attack has occurred so that corresponding mitigating mechanisms can be invoked. For this application, we use the ‘smart water treatment’ (SWaT) dataset^{19}. Finally, Internet of Medical Things (IoMT) is a growing industry with a current market size of $42 billion. In applications like the smart detection of COVID^{20}, some data may either be corrupted or simply unavailable. Even under these circumstances, it may be of interest to reliably detect disease onset in a secure and private (in terms of inference on the network edge) manner. Since data may be scarce in such critical applications, simply throwing away corrupted data may not be a viable option.
Contributions
In this work, we aim to address the challenge of data imputation by proposing a DNNbased surrogate modeling approach—data imputation using neural inversion (DINI). We leverage gradientbased optimization using backpropagation to the input (GOBI)^{21}, implemented through neural inversion^{22}. DINI implements interleaved training (of the surrogate model) and imputation (of the data). As a surrogate model is trained, it can impute the corrupted data better, making an even superior model available for the next training iteration. We hypothesize that an interactive dynamic between imputation and training ensures more informed data generation and surrogate modeling. DINI can handle variegated data types, including multiinput/multioutput datasets. Input data can be continuous or categorical; the output may also have categorical labels or continuous values. Unlike previous works^{15,16}, DINI can work with diverse types of DNN models, from fullyconnected neural networks (FCNNs) to advanced architectures like Transformers^{23}, whichever model works best for the given data distribution and model setting. Finally, DINI can output the uncertainty in predicted values like recent works^{15}.
Figure 1 shows a highlevel working schematic of the DINI framework. Tabular input (with features (text {F}_1)–(text {F}_4)) and output data (with features (text {Y}_1)–(text {Y}_4)) support both continuous and categorical features, along with their combinations. Figure 1a and b show these, respectively. We show only the first three observations (rows (text {O}_1)–(text {O}_3)). NaN values represent corrupted data. Output features (text {Y}_1) and (text {Y}_2) are categorical (may or may not be onehot encoded). Previous works often refer categorical onehot encoded output features as output labels. Since we support an expanded set of output formats, like the inputs, we refer to them as output features instead. Figure 1c shows DINI leveraging a DNNbased surrogate model (here, an FCNN) to map the input to the output and vice versa. During training, we backpropagate the gradients (from an appropriate loss function) to the weights (shown in red). During imputation, we freeze the model weights and backpropagate the gradients to the input/output features to predict the missing values (shown in blue). Figure 1d shows a Transformerbased surrogate model for timeseries data, supported by DINI. Only one encoder layer is shown (it can be repeated N times) with four selfattention (SA) heads followed by an FCNN. Figure 1e shows a highlevel schematic of the DINI pipeline. We first impute the corrupted data (with NaN values) with an initial imputation method (details in section “Methodology”) and then forward them to the DINI framework. DINI implements an interleaved trainingandimputation pipeline, which iteratively trains the surrogate model and imputes the data based on the updated model in a repeated fashion. This not only outputs an imputed dataset with no corruptions, but also a superior surrogate model that better represents the data distribution.
DINI outperforms baseline methods by at least 10.7% in reducing average error across diverse datasets. We further demonstrate the effectiveness of DINI in three case studies involving missioncritical edge applications. Moreover, we propose novel corruption techniques motivated by the distribution of corrupted data found in edgeAI settings. We show that DINI outperforms baseline approaches, giving much higher prediction performance for the required label.
Outline
The rest of the article is organized as follows. Section “Background and related work” discusses background material on data corruption strategies, related works on imputation, and their critique. Section “Methodology” presents the DINI framework in detail. Section “Experimental setup” describes the experimental setup and presents the datasets used and baseline approaches for comparison. We validate our proposed framework and discuss the results in section “Results”. Section Discussion discusses limitations and future work directions. Finally, section Conclusions concludes the article.
Background and related work
Various synthetic corruption methods have been widely used in the literature. We give a brief overview of these methods in this section. We then describe related works on data imputation and highlight their limitations.
Synthetic corruption methods
As pointed out before, corrupted data are inherently assumed to be missing. Mathematically, let the data be denoted by a matrixvalued random variable (mathbf {X} in mathbb {R}^{n times d}), where n is the number of observations (rows) and d is the data dimension (columns). Now, (mathbf {x}) denotes a realization of (mathbf {X}) and (tilde{mathbf {x}}) denotes its observation. Note the difference between realized and observed values of the data^{24}. The observed value is a function of the instantiation of the random variable for the data and its missingness. More concretely, let (mathbf {M}) denote the missingness in input data (it has the same dimensions as (mathbf {X})). The ((i, j)mathrm{th}) element of (mathbf {M}) is 1 if the corresponding element of (mathbf {X}) is observed and 0 if it is missing. In summary, (mathbf {x} sim mathbf {X}) and its observation is a function of (mathbf {x}) and (mathbf {m}), i.e., (tilde{mathbf {x}} = o(mathbf {x}, mathbf {m})), where (mathbf {m} sim mathbf {M}), such that:
$$begin{aligned} tilde{mathbf {x}}_{ij} = {left{ begin{array}{ll} mathbf {x}_{ij}, &text {if } mathbf {m}_{ij} = 1\ texttt {NaN}, &text {otherwise} end{array}right. } end{aligned}$$
For the purpose of surrogate modeling, (tilde{mathbf {x}}) is divided based on input and output feature columns as (tilde{mathbf {x}} = left[ tilde{mathbf {x}}_{in} tilde{mathbf {x}}_{out}right]), where ([ cdot ]) denotes concatenation of matrices in block notation. Here, (tilde{mathbf {x}}_{in} in mathbb {R}^{n times d_{in}}) and (tilde{mathbf {x}}_{out} in mathbb {R}^{n times d_{out}}). The observed data can be further categorized into correctly observed (denoted by (tilde{mathbf {x}}^{o})) or corrupted (denoted by (tilde{mathbf {x}}^{c})) values. Table 1 summarizes the notations used in this work.
Here, the reader may notice a difference between our definition of observed values from those used in the literature^{24}. Realized data are the data we would get when there is no source of corruption. Observed data are the complete data that we see with the corruption (i.e., with NaN values). The part of the observed data that is correct, unlike previous works, is called correctly observed data ((tilde{mathbf {x}}^o)); part of the data that is corrupted/missing is simply called corrupted data ((tilde{mathbf {x}}^c)). The slight change in notation is motivated by the need to unify previous inconsistencies^{12,15,24,25} and bind our formulation to the context of data corruption.
Rubin^{26} has defined a widely used, yet controversial^{24}, nomenclature for synthetic corruption (or missing value) mechanisms. We present these next.
Missing completely at random
The first is missing completely at random (MCAR). In MCAR, the data are corrupted entirely at random, i.e., there is no dependency on the data. Consider a hypothesized missingness model (phi). Then, as per the MCAR scheme:
$$begin{aligned} P_phi (mathbf {M}tilde{mathbf {x}}^{o}, tilde{mathbf {x}}^{c}) = P_phi (mathbf {M}) end{aligned}$$
In other words, the missing values do not depend on either the correctly observed or the corrupted values, which constitute the observed data (tilde{mathbf {x}}). Here, (phi) is a uniform sampling model that corrupts data completely randomly.
Missing at random
The term missing at random (MAR) is a misnomer. Basically, MAR corruption refers to the missingness depending solely on the correctly observed data, or:
$$begin{aligned} P_phi (mathbf {M}tilde{mathbf {x}}^{o}, tilde{mathbf {x}}^{c}) = P_phi (mathbf {M}tilde{mathbf {x}}^{o}) end{aligned}$$
Here, (phi) is a logistic missingness model^{25}. First, a subset of variables (columns) with no missing values is randomly selected. The remaining variables have missing values based on a logistic model with random weights, depending on the correctly observed data, rescaled to attain the desired proportion of missing values for those variables.
Missing not at random
Data are said to be missing not at random (MNAR) if the missingness is neither MCAR nor MAR. More specifically, data are MNAR if the missingness depends on the correctly observed and potentially even the corrupted values. In this context, the missingness cannot be fully accounted for by the correctly observed values. Here, we implement (phi) as a selfmasking logistic model^{25}. The values are masked based on a probability given by the logistic model with random weights, having the entire data matrix (mathbf {x}) as input.
Missing streams at random
To go beyond traditional corruption schemes, we propose two corruption techniques inspired by the distribution of corrupted data in diverse edge deployments^{11,12}. Sensor data from various sources in a distributed IoT network can get corrupted, and once corrupted, likely stay corrupted for extended periods of time before being reset. To account for such scenarios, we propose the missing streams at random (MSAR) corruption technique. In this case, the missingness model (phi) chooses points in the data matrix at random and, unlike MCAR, corrupts a stream (of length 10 in our experiments) of datapoints through that column. This model is especially relevant to timeseries data.
Missing patches at random
To account for spatiotemporal correlation in the corruption process, we further propose the missing patches at random (MPAR) corruption mechanism. In a distributed environment, sensors are often closely placed in groups (to implement redundancy in some cases). For example, some sensors might be placed in one part of a smart facility and others in another. If one sensor fails in a group, several sensors in the group may likely fail. Thus, rather than streams (involving a single column), patches of data will get corrupted. Here, (phi) chooses points in the data matrix randomly, then corrupts a patch (of size (5 times 5) in our experiments) around that point.
Data imputation methods
We can categorize previously proposed imputation methods as either discriminative or generative. Discriminative methods include multivariate imputation by chained equations (MICE)^{14}, matrix completion^{27}, spectral regularization^{28}, iterative singular value decomposition (SVD)^{29}, and knearest neighbors (kNN)^{13}. Generative models include algorithms based on expectation maximization, such as those using Gaussian mixture models (GMMs)^{30} and approaches based on modern deep learning, like denoising autoencoders (DAEs)^{31,32} and generative adversarial networks (GANs). One stateoftheart GANbased imputation method is GAIN^{15}, which forgoes the assumptions made in previous generative imputation models—restrictions on the underlying data distribution and types of datasets (categorical or continuous). GRAPE^{16} is yet another DNNbased approach that converts the data into a bipartite graph and then uses a graph neural network (GNN) for imputation.
Traditional statistical methods for imputation provide useful theoretical bounds but exhibit notable shortcomings. First, they tend to make strong assumptions about the data distribution. Second, they lack flexibility for handling mixed data types that include both continuous and categorical variables. Finally, matrixcompletionbased approaches do not generalize to unseen samples (thus performing poorly on outofdistribution data) and require retraining when new data samples are encountered^{13,27,28,29}. Recent DNNbased approaches try to address these shortcomings but are still limited in their application. GAIN only implements input feature imputation and assumes that all output labels are available^{15}. GRAPE does either input feature imputation or output label prediction, but not both^{16}. It also does not support uncertainties in prediction, only models the expectation of the data distribution. Other recent works that use these methods, or their combination, are only applicable to specific applications^{33}. In many applications, especially in the context of edge deployments, both input features and output labels may be missing^{12}. Further, the output features in previous works are only onedimensional (only one continuous feature or categorical label). These restrictions prevent their application to many tasks, including multiinput/multioutput regression datasets. In the case of such datasets and under some corruption strategies (e.g., when the output can also be corrupted), even stateoftheart DNNbased approaches become ineffective, as we demonstrate later. DINI, on the other hand, can support mixed continuous and categorical features not only in the input but also in the output. Lastly, previous DNNbased works are either restricted to adversarial networks^{15}, autoencoders^{31}, or GNNs^{16}. However, different DNN models may be suitable for different data distributions. DINI, being a modelagnostic framework, can be applied to diverse DNN architectures, including FCNNs, convolutional neural networks (CNNs)^{34}, longshort term memories (LSTMs)^{35}, and even Transformers^{23}.
Methodology
We now discuss the DINI framework in detail.
Problem formulation
As noted previously, we consider the imputation (via surrogate modeling) of the observed dataset (tilde{mathbf {x}} in mathbb {R}^{n times d}), partitioned into input and output columns as (tilde{mathbf {x}}_{in} in mathbb {R}^{n times d_{in}}) and (tilde{mathbf {x}}_{out} in mathbb {R}^{n times d_{out}}). For betterposed modeling, we first scale the input data to [0, 1] with a MinMax scaler^{36}. The task at hand is to output an imputed dataset (hat{mathbf {x}}) that is as close as possible to the real dataset (mathbf {x}), had there not been any corruption. The goal is to achieve the least error between the imputed and real data. The two error metrics are the root mean square error (RMSE) and mean absolute error (MAE)^{37}, defined as follows:
$$begin{aligned} text {RMSE}(mathbf {x}, hat{mathbf {x}})&= sqrt{frac{1}{nd} sum _{ij} left( x_{ij}  hat{x}_{ij} right) ^2}, quad forall x_{ij} in mathbf {x}, hat{x}_{ij} in hat{mathbf {x}}\ text {MAE}(mathbf {x}, hat{mathbf {x}})&= frac{1}{nd} sum _{ij}  x_{ij}  hat{x}_{ij} , quad forall x_{ij} in mathbf {x}, hat{x}_{ij} in hat{mathbf {x}} end{aligned}$$
Note that from the (texttt {NaN}) values in (tilde{mathbf {x}}), the missingness mask (mathbf {m} in mathbb {R}^{n times d}) is recoverable and can also be similarly partitioned into (mathbf {m}_{in} in mathbb {R}^{n times d_{in}}) and (mathbf {m}_{out} in mathbb {R}^{n times d_{out}}).
The DINI framework
DINI comprises two DNNs that act as surrogate models for the data distribution. Each DNN models one side (inputtooutput or outputtoinput) of the dataset and runs GOBI for imputation. Thus, the surrogate model of DINI is given by (mathcal {F}) that comprises two functions, one being the forward model (f_{theta _1}: [0, 1]^{d_{in}} rightarrow [0, 1]^{d_{out}}) and the other the backward model (b_{theta _2}: [0, 1]^{d_{out}} rightarrow [0, 1]^{d_{in}}). Here, (theta _1) and (theta _2) are the parameters, or weights of the DNNs, for the forward and backward models, respectively. DINI involves interleaved training of the surrogate model (mathcal {F}) (where the neural network parameters (theta _1) and (theta _2) are updated) and imputation (where the (hat{mathbf {x}}) data are updated).
Algorithm 1 summarizes this interleaved trainingandimputation pipeline. First, isNaN () recovers the missingness masks in the input and output data (line 17). Then, initImpute () takes the observed data and outputs them after running an initial imputation on the (texttt {NaN}) values so that the data are amenable to training the surrogate model (line 18). This could be either mean, random, or zero imputation. Based on our tests, zero imputation performs the best. This could be attributed to the high gradient of the logistic function at zero, leading to faster convergence for the corrupted values. Then, we run interleaved training and imputation until convergence (lines 2223). Here, when the new imputed data gets close enough to the old data based on a threshold (epsilon _{texttt {DINI}}) (line 24), the algorithm reaches convergence. During training, the forward and backward models are trained by backpropagating the gradients of an appropriate loss function to their respective parameters ((theta _1) and (theta _2); line 5). The red color shows the operation of gradients towards the weights. Here, we show stochastic gradient descent for simplicity, although we used the Adam optimizer^{38} in our experiments. To account for both continuous and categorical values in the input and output features, we consider the loss function as a sum of the RMSE and the MAE between the predicted and actual data matrices. Mathematically,
$$begin{aligned} mathcal {L}^f(mathbf {x}, hat{mathbf {x}}) = mathcal {L}^b(mathbf {x}, hat{mathbf {x}}) = text {RMSE}(mathbf {x}, hat{mathbf {x}}) + text {MAE}(mathbf {x}, hat{mathbf {x}}) end{aligned}$$
The loss function could also have leveraged the categorical crossentropy loss, where the variables are known to be categorical and onehot encoded. During imputation, the model weights are frozen and gradients are computed towards the respective inputs, i.e., (hat{mathbf {x}}_{in}^p) and (hat{mathbf {x}}_{out}^p) (line 12). Again, blue type color represents the operation for gradients towards the features. We only impute that part of the data that is known to be corrupted, using the masks (mathbf {m}_{in}) and (mathbf {m}_{out}). Leveraging Monte Carlo (MC) dropout^{39}, the forward and backward models output the data distribution, whose standard deviation gives the uncertainty. Partial imputation can be performed based on the least uncertain predictions. This is implemented by the maskedUpdate () function (line 13). If some variables are categorical, this function also forces the corresponding imputed values to 0 or 1 based on a threshold (set to 0.5). Training or imputation reaches convergence when the (L_1)norm of the respective gradients falls below a threshold. Finally, the DINI () function outputs the trained surrogate model (mathcal {F}) along with the imputed data matrix (hat{mathbf {x}}) (line 25). Note that, unlike what Figure 1c shows, we implement the surrogate model as a set of two functions ((f_{theta _1}) and (b_{theta _2})) that we train in tandem. This aids the implementation of GOBI in a conserved manner. We defer the implementation of DINI using weightshared models, or even a single model, to future work.
Experimental setup
In this section, we discuss details of the experimental setup. First, we present the model architecture and training hyperparameters. We then discuss the datasets used for the imputation problem and the surrogate modeling tasks for three missioncritical edge applications. Finally, we briefly discuss the baselines used for comparison with the DINI model.
The model architecture
As explained in section “The DINI framework”, we implemented the forward and backward models as two DNNs. For our experiments, we chose the DNNs to be FCNNs with the input and output number of neurons equal to the corresponding data dimensions. More concretely, for the forward model f (backward model b), we set the number of input neurons to (d_{in}) ((d_{out})) and the number of output neurons to (d_{out}) ((d_{in})). We ran a grid search over the number of hidden layers and the dimension of each hidden layer. We found that the smallest architecture that achieved a reasonable RMSE ((<1 times 10^{3})) on the uncorrupted data (for all considered datasets) needs only one hidden layer with 512 neurons. We use leaky ReLU as the activation function for each layer except for the output layer, where we use the sigmoid activation function. Any DNNbased surrogate model can leverage DINI. Hence, for timeseries datasets, we further tested LSTMbased^{35} architectures and Transformers^{23} as well. Figure 1d shows how a Transformerbased surrogate model employs DINI. However, we found that for the datasets considered, FCNNs were the simplest architectures that also performed the best in imputation performance (see section “Ablation analysis”). We leave other applications with more complex data distributions that require DINI with advanced deep learning models for future exploration. We set the hyperparameters for the DINI pipeline as follows. We set the learning rates to (eta _1 = eta _2 = 1 times 10^{4}), (eta _{in} = eta _{out} = 5 times 10^{4}). We use a weight decay of (1 times 10^{3}). We set Adam optimizer’s parameters to (beta _1 = 0.9), (beta _2 = 0.999). Finally, we set all convergence thresholds to (1 times 10^{3}).
Imputation datasets
To measure the imputation performance (in terms of RMSE and MAE), we consider a diverse set of popular machine learning datasets, including those used by previous works^{15,16}. These datasets include ones from the popular UCI repository^{40}: breast cancer Wisconsin prognostic dataset (Breast), energy efficiency dataset (Energy), and the yacht hydrodynamics dataset (Yacht). Since DINI can also tackle multioutput datasets, we consider such datasets as well. For this, we consider two prediction outputs in the Energy dataset: separate heating and cooling loads, which previous works do not^{16}. We also consider other popular datasets like the Diabetes dataset^{41} (with six blood serum estimates and the responses of interest as continuousvalued outputs), the Diamonds dataset^{42} (with carat and price as two continuousvalued prediction outputs), and the Flights dataset^{43} (with two categorical outputs, namely whether the flight was diverted or canceled, and three continuous outputs: departure and arrival delays along with the estimated flying time). Further, unlike previous works, we carry out corruption not only on input features but also on the output features.
Case studies
For case studies related to missioncritical edge applications, we consider three datasets, as described in section “Motivation”. The first is the Gas dataset^{18} that is from the UCI repository^{40}. It contains mixtures of gases at different concentrations. In the context of detecting flammable gases, we take measurements from 15 sensors as input and set the detection label for flammable gases as the categorical output. The second is the SWaT dataset^{19} with a diverse set of categorical and continuous input features, and detection of attack as the prediction label. Finally, we consider the smartCOVID detection dataset^{20} that considers age, sex, offset of days since symptoms appeared, type of pneumonia, and features extracted from chest Xrays^{44}.
Baselines
To validate DINI’s imputation and surrogate modeling performance, we compare it against various baselines, as mentioned in section “Data imputation methods”. For completeness, we present these commonly used imputation methods below:

Mean/median imputation: The method imputes the corrupted values (tilde{mathbf {x}}_{ij}) with the mean/median of all correctly observed samples along column j.

kNN imputation: The method imputes the corrupted rows i in (tilde{mathbf {x}}_{ij}) based on the kNN along column j with the weights based on the Euclidean distance to the row.

SVD imputation: The method imputes missing values based on matrix completion with iterative lowrank SVD decomposition.

MICE imputation: The method runs multiple regressions where each missing value is modeled conditioned on the observed nonmissing values.

Spectral imputation: This matrix completion model uses the nuclear norm as a regularizer and imputes missing values with iterative softthresholded SVD.

Matrix imputation: This method finds the matrix with the minimum nuclear norm that fits the correctly observed data.

GMM imputation: This approach fits a GMM on the observed data using the expectationmaximization algorithm and imputes the missing values based on the model.

GAIN imputation: A generativeadversarialnetworkbased input feature imputation strategy.

GRAPE imputation: A stateoftheart imputation method that converts data into a bipartite graph and uses a GNN model for imputation.
GAIN only does input feature imputation. GRAPE either implements input feature imputation or output label prediction, but not both simultaneously. We adapt these models, based on the new formulation of DINI, as a forward and a backward model. We then apply these methods to the input and output features based on these models. We call these adaptations GAIN(^*) and GRAPE(^*).
The time complexity of the proposed DINI algorithm (see Algorithm 1) is (mathcal {O}(n d^2)) for one iteration of imputation of the entire dataset. This is because the forward pass of an FCNN (and even backpropagation) implements matrix multiplication operations in practice. For the considered architecture ((d_{in} < d) input neurons, 512 hidden neurons, (d_{out} < d) output neurons), this is implemented in (mathcal {O}(n d^2)) time. The same is true for both training the surrogate model and imputation. Here, training and imputation are assumed to be for a fixed number of epochs. Classical approaches like Mean and Median have (mathcal {O}(nd)) time complexity. kNN has a time complexity of (mathcal {O}(knd)). On the other hand, stateoftheart DNNbased methods, GAIN and GRAPE, have time complexities (mathcal {O}(nd^2)) and (mathcal {O}(rnh^2)), respectively, where r is the number of neighbors sampled for each node and h is the node hidden feature dimension^{45}. The number of hidden layers is assumed to be one for both these methods. This implies that DINI is comparable to previous DNNbased methods in computational complexity.
Results
This section presents performance comparisons for DINI with baseline imputation methods. Since DINI inherently works with a DNNbased surrogate model, we subsequently present its modeling performance by testing the corresponding label detection performance on three missioncritical edge applications. Finally, we present ablation studies.
Imputation performance
We compare DINI with the baseline imputation methods described in section “Baselines”. For this comparison, we test the RMSE and MAE of the imputed data relative to the actual data when subjected to different corruption strategies (including the two newly proposed ones). Table 2 compares the imputation performance of DINI across six datasets and five corruption strategies against the considered baselines. DINI outperforms the baselines for most tasks (46 out of 60 rows). Spectral imputation performs the worst on most datasets. GAIN(^*) does not perform well on the Yacht dataset when subjected to corruption in both the input and output features. On an average, DINI outperforms the next best imputation method, i.e., MICE, by 10.7% in terms of imputation error. Even though MICE inherently assumes the corruption to be either MCAR or MAR, DINI achieves a lower error even under these strategies for most datasets. Unlike the results presented in previous works^{15,16}, as we see here, even stateoftheart DNNbased methods are not that effective when subjected to simultaneous input/output corruption. DINI outperforms GAIN(^*) and GRAPE(^*) by 36.8% and 33.9%, respectively.
Surrogate modeling performance
Since DINI is more than an imputation method, we can leverage the implicit surrogate training for tasks beyond filling missing values. Previous works have widely used surrogate training and inference; however, seamless exploitation of corrupted data (using interleaved imputation) is novel and is broadly applicable to edge applications where corrupted sensor data are commonplace. Hence, we leverage this extra capability of DINI to obtain better surrogate models for such applications. We use three missioncritical applications as case studies. We formulate the comparison experiments as follows. For each dataset, we split the data three ways: 40%40%20%. We assume 40% of the data is heavily corrupted (no row can be extracted that does not have any corrupted values). For this, we use MSAR or MPAR corruption with close to 100% corruption ratio. The first 40% of the uncorrupted data and the 40% corrupted data comprise the 80% training set for imputation and surrogate model training. We use the final 20% of the data as the test set. For liketolike comparisons, with each imputation strategy, we use the same architecture for the surrogate model trained on the imputed data: FCNN with one hidden layer having 512 hidden neurons. Figures 2 and 3 show the modeling performance on the three datasets for imputed data from DINI and all the baseline methods. Note that we do not consider Mean imputation because it imputes categorical columns with an intermediate value that is not allowed (if the mean value is forced to 0 or 1 based on a threshold, the performance becomes close to that of Median imputation). GRAPE(^*) is also not considered in these comparisons since it only outputs RMSE/MAE in imputations in its graph format and does not convert the imputed data back to the tabular format for surrogate modeling. For the Gas dataset, we need to detect whether the flammable gas is observed or not. For the smart water plant (SWaT dataset), we need to detect if the system has been attacked. On the other hand, for smartCOVID detection, we need to detect if the patient has the disease. Since all these datasets have a single categorical output, we train the forward model in DINI with binary crossentropy loss. In all these tasks, we not only wish to leverage the corrupted, partially observed data, but also need a high true positive rate since false negatives would incur high risks in such applications. On the other hand, we also need a low false positive rate since invoking mitigating mechanisms could be costly, and performing them needlessly could result in large system overheads. Hence, we plot the F1 score along with the test accuracy.
DINI consistently outperforms the baseline imputation methods with a high test accuracy and F1 score. For example, DINI attains around 99% test accuracy and 0.99 F1 score on the Gas dataset, implying that almost all cases where a flammable gas is present are correctly detected. No other imputation strategy approaches this performance. For the SWaT and COVID datasets, DINI reaches around 96% (0.95) and 97% (0.96) average test accuracy (F1 score) across the two corruption strategies, respectively. However, for some imputation strategies, like Median imputation with the Gas dataset under MSAR corruption, the F1 score is very low even when the test accuracy is reasonable. This is because the surrogate model is heavily biased toward negative labels (since the model has not generalized well), having a high number of true negatives but few true positives. This results in a low F1 score. DINI does not suffer from this problem.
Figure 4 shows how we passed the corrupted data to the imputation models in their training set. We observe that the data imputed by DINI, shown in Fig. 4c, are very similar to the original data, shown in Fig. 4a. This striking similarity shows that DINI can reproduce the underlying data distribution even in the presence of high levels of corruption. Figure 5 compares the imputation methods under different corruption ratios and the MCAR corruption strategy on the Breast dataset. DINI consistently outperforms baselines by achieving a lower RMSE and MAE for the different corruption ratios.
Ablation analysis
To test the efficacy of our interleaved trainingandimputation strategy, we modify the DINI framework as follows. First, we train the surrogate model on the correctly observed subset and use this model for imputation (using GOBI for the input and output features from the forward and backward models). Second, we pretrain the surrogate model on the correct subset and run interleaved training and imputation on the corrupted subset. Note that we impute all the data at every iteration. Third, we run interleaved training and imputation from scratch (i.e., with no pretraining) on the entire dataset, as described in the DINI pipeline above. However, we attempt to leverage the uncertainty in prediction through the MC dropout layer. We thus only impute part of the data, where the model is the least uncertain. Based on the uncertainty values for the entire data matrix, we start at the 25th, 50th, or the 75th percentile of the uncertainties and impute only part of the data accordingly. To account for the surrogate model getting better towards the end of training, we linearly increase the imputation ratio to 100%. Table 3 shows the results on the Breast dataset with MCAR corruption (other datasets showed similar results). We observe that the method involving interleaved training and (complete) imputation from scratch outperforms previous approaches. Here, by complete imputation, we mean that 100% of the data are imputed at every iteration, regardless of the uncertainties. We explain this as follows. In the first approach, we do not leverage imputed data to improve the surrogate model further. In the second approach, after pretraining the surrogate model, training on the imputed data causes a distribution shift, as the model cannot train along with the correctly observed data. Finally, partial imputation adds to the bias present in the surrogate model initially, resulting in a higher imputation error. However, certain tasks requiring multiple solutions could benefit from the uncertainties in predictions. Taking inspiration from some recent works^{46} that leverage GOBI, we also tested secondorder gradients using the AdaHessian optimizer^{38} in DINI’s surrogate model. This only provides marginal gains (reduction in MAE by 0.001) that are not statistically significant. Due to the high overhead of calculating these gradients, we stayed with firstorder gradients in our experiments.
DINI supports diverse DNNbased surrogate models, including advanced architectures like LSTMs and Transformers. Table 4 compares these architectures with the FCNN used in our experiments for timeseries datasets. FCNN performs slightly better than a Transformer with six encoder layers in most cases while being 24,603(times) smaller on average. This could be due to the FCNN having enough capacity for the chosen datasets, while the Transformer overfits on the training data resulting in lower performance.
Discussion
As discused in section “Results”, DINI outperforms baseline methods in various experimental settings. The interleaved trainingandimputation pipeline enables high gains compared to the stateoftheart methods. Further, it directly incorporates heterogeneous input and output feature formats (continuous, categorical, or a combination thereof). These advancements make it better at imputing data compared to traditional approaches. Unlike previous works, it is a unified framework that supports diverse DNNbased model architectures.
However, DINI has several limitations. For instance, it only imputes data that are known to be corrupted. One could also encounter adversarial data with fraudulent input feature values and noisy labeled data, where the corrupted data are not in the NaN form. Detecting such data falls under the scope of adversarial attack detection^{47} and confident learning^{48}, respectively. One could extend the DINI model by incorporating aleatoric loss^{49} to account for such corruptions. We can also prune or correct the input or output entries with high uncertainties^{50} (after conversion to NaN values and subsequent imputation). We defer this to future work.
Conclusions
In this article, we presented DINI, a pipeline for interleaved training of a surrogate model and imputation of data, leveraging gradients towards the input and output features in the model. DINI tackles corruption in both the input and output values, along with mixed continuous and categorical features in either. For betterposed problem formulation in edgeAI settings, we proposed novel corruption strategies that model the distribution of corrupted data in such applications more closely. We showed that DINI outperforms all baseline imputation methods, including stateoftheart DNNbased models, achieving 10.7% lower imputation error relative to the next best baseline. Finally, we tested the modeling performance of DINI on missioncritical edge applications and showed that it can reach up to 99% test accuracy and 0.99 F1 score when detecting labels in such settings.
Data availability
All data and code are available in the supplementary files. The code and relevant testing scripts are made publicly available on GitHub under the BSD3 license at https://github.com/jhalab/dini.
References

Gill, S. S. et al. Transformative effects of IoT, blockchain and artificial intelligence on cloud computing: Evolution, vision, trends and open challenges. Internet Things 8, 100–118 (2019).
Article
Google Scholar 
Vailshery, L. S. IoT connected devices worldwide 20192030. https://www.statista.com/statistics/1183457/iotconnecteddevicesworldwide/. Accessed 14 June 2022 (2022).

Ding, A. Y. et al. Roadmap for edge AI: A Dagstuhl perspective. ACM SIGCOMM Comput. Commun. Rev. 52, 28–33 (2022).
Article
Google Scholar 
Rausch, T. & Dustdar, S. Edge intelligence: The convergence of humans, things, and AI. Proc. Int. Conf. Cloud Eng. 2019, 86–96 (2019).

Dustdar, S., Casamajor Pujol, V. & Donta, P. K. On distributed computing continuum systems. IEEE Trans. Knowl. Data Eng. 2022, 156 (2022).

Zhang, K., Leng, S., He, Y., Maharjan, S. & Zhang, Y. Mobile edge computing and networking for green and lowlatency Internet of Things. IEEE Commun. Mag. 56, 39–45 (2018).
Article
Google Scholar 
Akmandor, A. O. & Jha, N. K. Smart health care: An edgeside computing perspective. IEEE Consumer Electron. Mag. 7, 29–37 (2017).
Article
Google Scholar 
ElSefy, M., Yosri, A., ElDakhakhni, W., Nagasaki, S. & Wiebe, L. Artificial neural network for predicting nuclear power plant dynamic behaviors. Nucl. Eng. Technol. 53, 3275–3285 (2021).
Article
CAS
Google Scholar 
Yun, M. & Yuxin, B. Research on the architecture and key technology of Internet of Things (IoT) applied on smart grid. In Proc. Int. Conf. Advances in Energy Engineering 69–72 (2010).

Datta, S. K., Da Costa, R. P. F., Härri, J. & Bonnet, C. Integrating connected vehicles in Internet of Things ecosystems: Challenges and solutions. In Proc. Int. Symp. World of Wireless, Mobile and Multimedia Networks 1–6 (2016).

Gaddam, A., Wilkin, T., Angelova, M. & Gaddam, J. Detecting sensor faults, anomalies and outliers in the Internet of Things: A survey on the challenges and solutions. Electronics 9, 511 (2020).
Article
Google Scholar 
Emmanuel, T. et al. A survey on missing data in machine learning. J. Big Data 8, 1–37 (2021).
Article
Google Scholar 
Malarvizhi, R. & Thanamani, A. S. Knearest neighbor in missing data imputation. Int. J. Eng. Res. Dev. 5, 5–7 (2012).

van Buuren, S. & GroothuisOudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011).
Article
Google Scholar 
Yoon, J., Jordon, J. & van der Schaar, M. GAIN: Missing data imputation using generative adversarial nets. Proc. Int. Conf. Mach. Learn. 80, 5689–5698 (2018).

You, J., Ma, X., Ding, D. Y., Kochenderfer, M. & Leskovec, J. Handling missing data with graph representation learning. In Proc. Int. Conf. Neural Information Processing Syst. 19075–19087 (2020).

Duncan, M. A., Wu, J., Neu, M. C. & Orr, M. F. Persons injured during acute chemical incidentshazardous substances emergency events surveillance, 1999–2008. Morb. Mort. Wkly. Rep.: Surveill. Summ. 64, 18–24 (2015).

Fonollosa, J., Sheik, S., Huerta, R. & Marco, S. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sens. Actuat. B Chem. 215, 618–629 (2015).
Article
CAS
Google Scholar 
Goh, J., Adepu, S., Junejo, K. N. & Mathur, A. A dataset to support research in the design of secure water treatment systems. In Proc. Critical Information Infrastructures Security 88–99 (2017).

Cohen, J. P. et al. COVID19 image data collection: Prospective predictions are the future. Mach. Learn. Biomed. Imaging 1, 55 (2020).

Tuli, S., Poojara, S. R., Srirama, S. N., Casale, G. & Jennings, N. R. COSCO: Container orchestration using cosimulation and gradient based optimization for fog computing environments. IEEE Trans. Parallel Distrib. Syst. 33, 101–116 (2021).
Article
Google Scholar 
Kindermann, J. & Linden, A. Inversion of neural networks by gradient descent. Parallel Comput. 14, 277–286 (1990).
Article
Google Scholar 
Vaswani, A. et al. Attention is all you need. Proc. Int. Conf. Neural Inf. Process. Syst. 30, 5998–6008 (2017).

Seaman, S., Galati, J., Jackson, D. & Carlin, J. What is meant by “Missing at Random’’?. Stat. Sci. 28, 257–268 (2013).
Article
MathSciNet
MATH
Google Scholar 
Muzellec, B., Josse, J., Boyer, C. & Cuturi, M. Missing data imputation using optimal transport. Proc. Int. Conf. Mach. Learn. 119, 7130–7140 (2020).

Robin, D. B. Inference and missing data. Biometrika 63, 581–592 (1976).
Article
MathSciNet
Google Scholar 
Candès, E. J. & Recht, B. Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2009).
Article
MathSciNet
MATH
Google Scholar 
Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
MathSciNet
PubMed
PubMed Central
MATH
Google Scholar 
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).
Article
CAS
PubMed
Google Scholar 
GarcíaLaencina, P. J., SanchoGómez, J.L. & FigueirasVidal, A. R. Pattern classification with missing data: A review. Neural Comput. Appl. 19, 263–282 (2010).
Article
Google Scholar 
Gondara, L. & Wang, K. MIDA: Multiple imputation using denoising autoencoders. Proc. Knowl. Discov. Data Min. 1, 260–272 (2018).
Article
Google Scholar 
Pan, Z. et al. Imputation of missing values in time series using an adaptivelearned medianfilled deep autoencoder. IEEE Trans. Cybern. 2022, 1–12 (2022).

Xu, D., Peng, H., Wei, C., Shang, X. & Li, H. Traffic state data imputation: An efficient generating method based on the graph aggregator. IEEE Trans. Intell. Transp. Syst. 23, 13084–13093 (2022).
Article
Google Scholar 
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article
Google Scholar 
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Article
CAS
PubMed
Google Scholar 
Leskovec, J., Rajaraman, A. & Ullman, J. D. Mining of Massive Datasets (Cambridge University Press, 2014).

Chai, T. & Draxler, R. R. Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7, 1247–1250 (2014).
Article
ADS
Google Scholar 
Yao, Z. et al. Adahessian: An adaptive second order optimizer for machine learning. Proc. AAAI Conf. Artif. Intell. 35, 10665–10673 (2021).

Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proc. Int. Conf. Mach. Learn. 48, 1050–1059 (2016).

Dua, D. & Graff, C. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed 14 June 2022 (2017).

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–451 (2004).
Article
MathSciNet
MATH
Google Scholar 
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
Book
MATH
Google Scholar 
U.S. Department of Transportation. https://transtats.bts.gov/Homepage.asp. Accessed 14 June 2022 (2022).

Wang, L., Lin, Z. Q. & Wong, A. COVIDNet: A tailored deep convolutional neural network design for detection of COVID19 cases from chest Xray images. Sci. Rep. 10, 1–12 (2020).
CAS
Google Scholar 
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2021).
Article
MathSciNet
PubMed
Google Scholar 
Tuli, S., Dedhia, B., Tuli, S. & Jha, N. K. FlexiBERT: Are Current Transformer Architectures Too Homogeneous and Rigid? (2022). arXiv: 2205.11656.

Pang, T., Du, C., Dong, Y. & Zhu, J. Towards robust detection of adversarial examples. In Proc. Int. Conf. Neural Information Processing Syst. vol. 31 (2018).

Northcutt, C., Jiang, L. & Chuang, I. Confident learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021).
Article
MathSciNet
MATH
Google Scholar 
Wang, H., Shi, X. & Yeung, D.Y. Naturalparameter networks: A class of probabilistic neural networks. In Proc. Int. Conf. Neural Information Processing Syst. 118–126 (2016).

Abdellatif, A. A., Chiasserini, C. F., Malandrino, F., Mohamed, A. & Erbad, A. Active learning with noisy labelers for improving classification accuracy of connected vehicles. IEEE Trans. Veh. Technol. 70, 3059–3070 (2021).
Article
Google Scholar
Download references
Acknowledgements
This work was supported by NSF under Grant No. CNS1907831. We also acknowledge discussions and support from Shreshth Tuli.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and Permissions
About this article
Cite this article
Tuli, S., Jha, N.K. DINI: data imputation using neural inversion for edge applications.
Sci Rep 12, 20210 (2022). https://doi.org/10.1038/s41598022243691
Download citation

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598022243691
When all is said and done, now let's stop for a moment and consider that camDown is easy to use, easy to maintain and I can tell your friends would say the same!