1 Introduction
Consider a deep neural network model with parameters
which we train using the following update rule,(1) 
where
is a random variable representing data and
represents the loss function. Now, consider a slightly general form of the same update formula,
(2) 
The only change here is the introduction of which can be assumed to be some data transformation matrix. If
, we see that Stochastic Gradient Descent (SGD) is a special case of (
2) under the assumption that we approximate the expectation in (2) with finite iid samples (or a minibatch).Let us unpack the data transformation notation a bit to check what it offers. If a set of transformations are chosen beforehand, and applied to the data samples before training commences, simply represents data samples derived via data augmentation. On the other hand, may not necessarily be explicitly instantiated as above. For example, spherical CNN [16] shows that when point cloud type data are embedded on the sphere with spherical convolutional operators, then it is possible to learn representations of data that are equivariant to the group action of rotations with no explicit data augmentation procedure. In particular, these approaches register each data point on a standard template (like the sphere) on which efficient convolutions can be defined based on differential geometric constructions – in other words, utilizing the properties of the transformations of interest and how they relate the data points, such a treatment enables the updates to implicitly take into account the loss on . Conceptually, many results [16, 48, 42] on equivariance show that by considering the entire orbit of each sample (a 3D point cloud) during training, for special types of , it is possible to avoid explicit data augmentation.
We can take a more expanded view of the above idea. Repeated application of a transformation on data point produces a discrete sequence where . In general, the transformation matrix at the th step, denoted by , need not even be generated from a fixed matrix. Indeed, in practice is selected from a set of appropriate transformations such as rotation, blur and so on, with some ordering, which could even be stochastic. At a high level, approaches such as [16, 12] can be seen as a special case of (2). Making this argument precise needs adding an appropriate number of auxiliary variables and by averaging over all possible realizable ’s – the specific steps are not particularly relevant since apart from helping set up the intuition we just described, algorithms for equivariance to specific group actions do not directly inform our development. For the sake of convenience, we will primarily focus on the continuous time system since under the same initial conditions, the trajectories of both (continuous and discrete) systems coincide at all integers .
What does actually represent? There are two interpretations of : (i) it formalizes onthefly or instantaneous (smooth) data augmentation which are often used to accelerate training by exploiting symmetries in the landscape of , and (ii) a data dependent can be designed for invariancelike requirements, useful for downstream applications. In fact, learning data dependent transformations has also been explored by [14]. The starting point of this work is to exploit the view that the data sample provided to us is merely a snapshot of an underlying process which we will discuss shortly. Nonetheless, the key hypothesis is that specifying this process to our deep neural network model will be beneficial and provide a fresh perspective on some strategies that are already in use in the literature.
Main ideas.
The foregoing use of “process” to describe the data sample hints at the potential use of an ordinary differential equations (ODE). While ODE type constructions
can be used to characterize simple processes, it will be insufficient to model more complex processes that will better reflect practical considerations. The key challenge in directly instantiating the “” idea for SDEs is that it is clearly infeasible since there are infinite possible trajectories for the same initial conditions. Our main insight is that recent results in the SDE literature show that (under some technical conditions), the dynamics can be completely characterized by (functions of) the infinitesimal generator of the process which can be efficiently estimated using finite data. We exploit this result via a simple modification to the estimation procedure – one that can be directly used within anybackpropagation based training scheme. Specifically, we exploit the result from [2] where the authors call the generator Target Measure Diffusion map (TMDmap). This leads to our TMDlayerthat can be conveniently dropped into a network, and be used as a plugandplay module with just a few additional parameters. When utilized within standard deep learning pipelines, our layer allows incorporating much richer domain information if available, or as a regularizer or an augmentation scheme, or as a substitute to an existing layer. We find this is beneficial to the overall performance of the model.
Our contributions. Models such as a Neural ODE [10] and Neural SDE [34] usually parameterize the dynamical system as a standalone model, and show how gradients can be efficiently backpropagated through this module. We take a different line of approach: we propose a stochastic process inspired layer which, in its most rudimentary form, can be thought of as an augmentation scheme that can work with existing layers in deep neural networks. But different from explicit data augmentation (rotation, flipping) that happens in the input image space, our layer can be utilized in the feature space and is fully adaptive to the input. But it is more than another augmentation scheme. Our layer allows modeling the time varying/stochastic property of the data/features, and controls them by a proper parameterization which is highly parameter efficient. We show that this stochasticity is not only mathematically interesting, but can be exploited in applications including point cloud transformers, object segmentation and fewshot recognition.
1.1 Related Work.
Early work in vision has made extensive use of differential equations [7, 36, 45, 6]
, especially for segmentation. In machine learning, differential equations are useful for manifold learning
[4][3, 38] among others. Recently, a number of strategies combine differential equations with deep neural networks (DNNs) for solving vision problems. For example, [9] utilizes a conditional random field after the CNN encoder to refine the semantic segmentation results whose update rules can be viewed as a differential equation and [37, 22] uses a CNN to extract visual features before feeding them to an active contour model which iteratively refines the contour according to the differential equation. Separately, the literature includes strategies for solving differential equations with DNNs [24, 39, 32]. Over the last few years, a number of formulations including neural ODE [10], neural SDE [34] and augmented neural ODE [15] have been proposed, motivated by the need to solve differential equation modules within DNNs. Note that [34] proposes to stabilize the neural ODE network with stochastic noise, which leads to a neural SDE, a setting quite different from the one studied here. Finally, we note that SDEs as a tool have also been used for stochastic analysis of DNNs [8].2 Preliminaries
Background.Partial differential equation (PDE) is a functional equation in which the solution satisfies given relations between its various partial derivatives interpreted as multivariable functions. Consider a commonly used PDE model for segmentation – the heat equation, where depends on both and . By the celebrated FeynmanKac formula, we know that the solution can be equivalently written as a conditional expectation with respect to a continuous time stochastic process . This means that the solution (segmentation) can be obtained by averaging a sequence of stochastic integration problems. For prediction, we need an algebraic concept called the “generator” of a function (like a neural network) since we are more interested in the pushforward mappings .
Given a time invariant stochastic process , the (infinitesimal) generator of a function is defined as,
(3) 
If the process is deterministic, the expectation operator becomes identity, and so the generator simply measures the instantaneous rate of change in with respect to . In addition, say that can also be expressed as a (Itô) Stochastic Differential Equation (SDE), i.e., satisfies:
(4) 
where is a (multidimensional) Brownian motion with covariance , and represent drift and diffusion functions. Then, it turns out that can be written in closed form (without limit) as,
(5) 
where acts as a linear operator on functions , see [29]. We will shortly explain how to estimate and utilize .
Setup. Consider the setting where represents our input features (say, an image as a 3D array for the RGB channels) and is a network with layers. Let the data be in the form of points with , which lie on a compact dimensional differentiable submanifold which is assumed to be unknown. We assume that in our case is defined implicitly using samples , and so it is impossible to obtain closed form expressions for the operators in (5). In such cases, recall that, when , Diffusion maps [13] uncovers the geometric structure by using to construct an matrix as an approximation of the linear operator .
Interpreting SDE. Recall that when is used on the input space, it can model stochastic transformations to the input image (rotation and clipping are special cases). When is used on the feature space (e.g., in an intermediate layer of a DNN), it can then model stochastic transformations of the features where it is hard to hand design augmentation methods. Moreover, it enables us to parameterize and learn the underlying stochastic changes/SDE of the features.
Roadmap. In the next section, we describe the estimation of differential operator within deep network training pipelines. Based on this estimate, we define TMDlayer as a approximation to for a small time interval using Taylor’s theorem. In §4, we discuss four different applications of TMDlayer, where the pushforward measure
under the flow of features (interpreted as a vector field)
may be a reasonable choice.3 Approximating in Feedforward Networks
We now discuss a recently proposed nonparametric procedure to estimate given finite samples when . This is an important ingredient because in our setup, we often do not have a meaningful model of minibatch samples, especially in the high dimensional setting (e.g., images).
Constructing in DNN training. The definition in (3) while intuitive, is not immediately useful for computational purposes. Under some technical conditions such as smoothness of , and the rank of , [2] recently showed that for processes that satisfy (4), it is indeed possible to construct finite sample estimators of . In [2], the approach is called Target Measure Diffusion (TMD) so we call our proposed layer, a TMDlayer.
To construct the differential operator, we first need to compute a kernel matrix from the data. For problems involving a graph or a set of points as input, we can simply use the given data points ( would be the number of nodes in the graph, or the number of points in the set), while for problems with a single input (e.g., standard image classification), we may not have access to data points directly. In this case, we can construct the kernel matrix by sampling a batch from the dataset and processing them together because we can often assume that the entire dataset is, in fact, sampled from some underlying distribution.
After getting the set of data samples, we first project the data into a latent space with a suitable using a learnable linear layer, before evaluating them with an appropriate kernel function such as,
(6) 
We then follow [2] to construct the differential operator as follows: we compute the kernel density estimate . Then, we form the diagonal matrix with components . Here, we allow the network to learn by
(7) 
where can be a linear layer or a MLP depending on the specific application. Next we use to rightnormalize the kernel matrix and use which is the diagonal matrix of row sums of to leftnormalize . Then we can build the TMDmap operator as
(8) 
We will use (8) to form our TMDlayer as described next.
3.1 TMDlayer: A Transductive Correction via
Observe that (4) is very general and can represent many computer vision tasks where the density could be defined using a problem specific energy function, and is the source of noise. In other words, we aim to capture the underlying structure of the socalled image manifold [61] by using its corresponding differential operator (5). Intuitively, this means that if we are provided a network with parameters , then by Taylor’s theorem, the infinitesimal generator estimate can be used to approximate the change of as follows:
(9) 
where such that the th coordinate , and
is interpreted as a hyperparameter in our use cases, see Algorithm
1.Inference using . In the ERM framework, typically, each test sample is used independently, and identically i.e., network (at optimal parameters) is used in a sequential manner for predictive purposes. Our framework allows us to further use relationships between the test samples for prediction. In particular, we can design custom choices of tailored for downstream applications. For example, in applications that require robustness to small and structured perturbations, it may be natural to consider low bias diffusion processes i.e., we can prescribe the magnitude using almost everywhere for some small constant (akin to radius of perturbation) and structure using diffusion functions . Inference can then be performed using generators derived using the corresponding process.
Layerwise for improved estimation of . While (9) allows us to use for any network with no modifications, using it naively can be unsatisfactory in practice. For example, often we find that features from input layers might not be too informative for the task and may hinder training, especially in the early stages. We suggest a simple adjustment: instead of applying approximation in (9) on the entire network, we do it layerwise – this could be every intermediate layer or several layers of interest. It means that can in principle be any layer (e.g., a layer in graph neural networks or a layer in Resnet), as shown in Fig. 1.
Justification.
Recall that most feedforward neural networks can be completely defined by a finite sequence of linear transformations followed by activation functions (along with intermediate normalization layers). One option is to estimate
by directly applying the Taylor serieslike expansion in (9) on where represents the number of layers. However, from (9) we can see that the variance of such an estimate of the value
will be high due to the wellknown propagation of uncertainty phenomenon (across ’s). To avoid this, we can estimate in a sequential manner i.e., use to estimate . We will show in §4 that this parameterization can be useful in various applications.Synopsis. We briefly summarize the benefits of our TMDlayer next: Our TMDlayer can parameterize the underlying stochastic transformations of features, providing a way to augment features at any layer. The stochasticity/randomness in our TMDlayer is a stability inducing operation for robust predictive purposes [20]. Our TMDlayer is parameter efficient. All we need is a projection linear layer and a linear layer parameterizing the density and a scalar parameter . In practice, we can work with a small latent dimension (e.g., = 16) when constructing , thus the total number of parameters in TMDlayer is very small when compared with the layer function in most deep learning applications. But the reader will see that a mild limitation of the SDE perspective in practice is that, in principle, the dynamics may eventually get stuck in a metastable state. This means that in this case, the estimate will not be very informative in the forward pass, and so the gradient estimates might be biased. In such cases, it may be useful to add points by sampling on the orbit if needed. We will now describe four different vision settings where our TMDlayer can be instantiated in a plugandplay manner.
4 Applications
In this section, we evaluate our TMDlayer in the context of different applications. As a warmup, in §4.1, we demonstrate the use of TMDlayer on a simple image classification task. We study its properties in both inductive and transductive settings. Then, in §4.2, we move to learning with point cloud datasets. Here, we see that the data type naturally offers a suitable object for leveraging the features of TMDlayer. In this case, we conduct experiments in an inductive setting. Next, in §4.3, we explore the use of TMDlayer on a segmentation task (also in an inductive setting). We propose a novel deep active contour model which can be viewed as a dynamical process within a neural network. We demonstrate the use of our TMDlayer on top of such a dynamical process. Finally, in §4.4, we investigate fewshot learning. Here, the problem setup natively provides the graph needed for computing our and allows transductive inference.
4.1 A Simple Sanity check on Resnet
We start with a simple example of image classification on CIFAR10 [28] using Resnet [23], to demonstrate applicability of our TMDlayer and evaluate its behavior.
4.1.1 Role of TMDlayer: Finetuning/Robustify Resnet
We choose Resnet18 as the backbone network and simply treat each of its three residual blocks Res as (see [23] for details of a residual block) in TMDlayer as follows,
where is the feature at the th layer and is constructed from a minibatch of samples.
4.1.2 Experimental results
During training, we first sample data points in a batch and use it as the input so that we can construct . During test time, an input batch also contains samples (similar to training time), where increases from to . We can see from Table 1 that does have an influence on the test accuracy where a larger performs better than a smaller . A key reason is that using a larger can better capture the geometric structure of the data.
We also test whether our TMDlayer can help improve the robustness of the network. We can assess this property by adding random noise to the input image and evaluating the test accuracy (see Table 2). With our TMDlayer, the network is more noise resilient. This can be partly attributed to the use of our parameterized , which allows the network to control the stochastic process in the TMDlayer adaptively and dependent on the input. In summary, the performance profile is similar (Tab. 1) with small improvements in robustness (Tab. 2).
Inference w/ TMDlayer  Accuracy (%)  

1  No  75.15 
1  Yes  87.35 
10  Yes  87.65 
50  Yes  88.14 
100  Yes  88.52 
150  Yes  88.55 
200  Yes  88.25 
0.01  0.02  0.03  0.05  0.1  

Resnet18  87.54  83.90  75.85  53.87  17.27 
Ours  87.79  84.37  77.96  56.18  19.18 
4.2 Point cloud transformer
Tasks involving learning with point cloud data is important within 3D vision. The input here is usually a 3D point cloud represented by a set of points, each associated with its own feature descriptor. These points can be naturally thought of as samples from the underlying distribution which captures the geometric structure of the object. The problem provides an ideal sandbox to study the effect of our TMDlayer. But before we do so, we provide some context for where and how the TMDlayer will be instantiated. Recently, [19] proposed a transformer based model for point cloud learning which achieves stateoftheart performance on this task – and corresponds to an effective and creative use of transformers in this setting. Nonetheless, Transformer models are known to be parameter costly (e.g., see [5, 56, 59] for cheaper approximations effective in NLP settings) and it is sensible to check to what extent our TMDlayer operating on a simple linear layer can be competitive with the transformer layer proposed in [19]. Our goal will be to check if significant parameter efficiency is possible.
4.2.1 Problem formulation
Denote an input point cloud with points, each with a dimensional feature descriptor. The classification task is to predict a class or label for the entire point cloud.
4.2.2 Role of TMDlayer: Replace transformer layer
The point cloud transformer layer in [19] is constructed as,
(10) 
where FF refers to their feedforward layer (a combination of Linear, BatchNorm and ReLU layer), and
is the output of selfattention module which takes as an input (we refer the reader to [19] for more details of their network design, also included in our appendix).A Transformer layer is effective for point cloud because it simultaneously captures the relation between features of all points. Since our TMDlayer can be viewed as a diffusion operator which captures the structure of the underlying data manifold from the data, we can check to what extent its ability suffices. We use the TMDlayer on a single feedforward layer to replace the Transformer layer in (10).
(11) 
Surprisingly, it turns out that this simple layer can perform comparably with the carefully designed Transformer layer in (10) while offering a much more favorable parameter efficiency profile. Here, is constructed using points of the same point cloud (setting is identical to baselines).
4.2.3 Experimental results
Dataset. We follow [19] to conduct a point cloud classification experiment on ModelNet40 [54]. The dataset contains CAD models in object categories, widely used in benchmarking point cloud shape classification methods. We use the official splits for training/evaluation.
Network architecture and training details. We use the same network as [19] except that we replace each point cloud transformer layer with a TMDlayer built on a single feed forward layer. We follow [19] to use the same sampling strategy to uniformly sample each object via points and the same data augmentation strategy during training. The minibatch size is and we train epochs using SGD (momentum , initial learning rate , cosine annealing schedule). The hidden dimension is for the whole network and for constructing (in TMDlayer).
Results. We see from Table 3 that our approach achieves comparable performance with [19]. In terms of the number of parameters, using hidden dimension (used in this experiment) as an example, one selfattention layer contains k parameters; one linear layer contains k parameters; and the TMDlayer module only needs k parameters.
Method  Input  #Points  Accuracy(%) 

PointNet [41]  P  1k  89.2 
ASCN [55]  P  1k  89.8 
SONet [31]  P, N  2k  90.9 
KdNet [26]  P  32k  91.8 
PointNet++ [41]  P  1k  90.7 
PointNet++ [41]  P, N  5k  91.9 
PointGrid [30]  P  1k  92.0 
PCNN [1]  P  1k  92.3 
PointConv [53]  P, N  1k  92.5 
ACNN [27]  P, N  1k  92.6 
DGCNN [52]  P  1k  92.9 
PCT [19]  P  1k  93.2 
Ours  P  1k  93.0 
4.3 Object segmentation
Here, we show that our TMDlayer (a dynamical system) can also be built on top of another dynamical system. We do so by demonstrating experiments on object segmentation.
Recall that active contour models are a family of effective segmentation models which evolve the contour iteratively until a final result is obtained. Among many options available in the literature (e.g., [44, 49, 57]), the widely used ChanVese [7] model evolves the contour based on a variational functional. Here, we propose to combine the ChanVese functional with a deep network by parameterizing the iterative evolution steps and build our TMDlayer on top of it. We see that this simple idea leads to improved results. The appendix includes more details of our model.
4.3.1 Problem formulation
Let be a bounded open subset of , where is its boundary. Let be an image, object segmentation involves predicting a dense map in where (and ) indicates the object (and background). In our formulation, we parameterize the object contour by a level set function and evolve it within the DNN. We note that hybrid approaches using level sets together with DNNs is not unique to our work, see [37, 58].
4.3.2 Role of TMDlayer: in deep active contour model
Our proposed deep active contour model evolves the contour in the form of a level set function within the network, and the update scheme is,
(12) 
where is the level set function at layer and is derived from our proposed deep variational functional. The appendix includes more details of our model, the variational functional, and the derivation of update equation.
Denote the update function in (12) as . Then, our TMDlayer forward pass can be written as,
(13) 
Remark 1
Remark 2
Note that our proposed segmentation model is different from [58] which uses the variational energy function directly as the final loss, whereas we are parameterizing the updating steps within our network so that the final output will already satisfy low variational energy.
4.3.3 Experimental results
Dataset. The Vaihingen buildings dataset consists of 168 building images extracted from the training set of ISPRS “2D semantic labeling contest” with a resolution of 9cm. We use only 100 images to train the model and the remaining 68 serve as the test set.
Network Architecture and Experiment Setup. We use an encoder CNN with an architecture similar to [21] and [37]. The input is the original image. The network is trained with a learning rate of for 300 epochs using a batch size of . We setup our baseline using the same CNN architecture to predict the segmentation mask without our ChanVese update module. Previous works combining active contour model and deep learning [37, 33] often can only be used to provide segmentations of a single building based on manual initialization or another initialization (based on a separate algorithm) whereas our model can be used to segment multiple buildings in the image without any initialization. So, the results cannot be meaningfully compared. See our appendix for more details about the setup.
Results and Discussion. We use the average Intersection over Union (IoU) to evaluate the performance on Vaihingen dataset: the baseline yields 68.9 while our model without TMDlayer achieves 73.5 and our complete model with TMDlayer achieves 74.6, which is a significant improvement in terms of IoU. This experiment shows that our TMDlayer can be built on top of another dynamical system and can provide additional benefits. Qualitative results of the baseline and our model are shown in Fig. 2. Our method tends to predict a more precise shape/boundary, and also fixes some flaws/errors relative to the baseline results.
4.4 Fewshot learning
In way shot fewshot learning, the input is a set of samples which naturally forms a fully connected graph. This serves to construct the differential operator . To provide context for where and how our TMDlayer will be instantiated, we note that [25] proposed a GNN approach (EGNN) for fewshot learning and this model achieves stateoftheart performance. We show that by adding our TMDlayer, the performance increases by a clear margin.
4.4.1 Problem formulation
Fewshot learning classification seeks to learn a classifier given only a few training samples for every class. Each fewshot classification task
contains a support set which is a set of labeled inputlabel pairs and a query set (an unlabeled set where the learned classifier is evaluated). Given labeled samples for each of classes in the support set , the problem is a way shot classification problem.4.4.2 Role of TMDlayer: Use in graph neural network
Let be the graph formed by samples from the task , with nodes denoted as . The node feature update equation is designed as (we refer readers to [25] or our appendix for more details about the network)
(14) 
where is the feature of node at th layer, is the edge feature between node and node , and refers to the parameters in the update function. We abstract (14) as and use our TMDlayer as,
(15) 
Remark 3
In (15), the is constructed using samples from the same episode, and is a GNN module updating the node features using all node features and edge features.
4.4.3 Experimental results
Dataset. We follow [25] to conduct experiments on miniImageNet, proposed by [51] and derived from ILSVRC12 dataset [46]. Images are sampled from different classes with samples per class (size pixels). We use the same splits as in [43, 25]: , and classes for training, validation and testing respectively.
Network architecture and training details. We use the same graph neural network architecture and follow the training strategy as in [25] by utilizing the code provided by the authors. We add our TMDlayer as shown in (15) to each node update layer in the graph neural network, with a latent dimension of for constructing . We follow [25] to conduct experiments for 5way 5shot learning, in both transductive and nontransductive settings, as well as for both supervised and semisupervised settings. The network is trained with Adam optimizer with an initial learning rate of and weight decay of . The learning rate is cut in half every episodes. For evaluation, each test episode was formed by randomly sampling queries for each of classes, and the performance is averaged over
randomly generated episodes from the test set. Note that the feature embedding module is a convolutional neural network which consists of four blocks (following
[25]) and used in most fewshot learning models without any skip connections. Thus, Resnetbased models are excluded from the table for a fair comparison. We refer the reader to [25] or the appendix for more training and evaluation details.Results. The performance of supervised and semisupervised 5way 5shot learning is given in Tables 4–5 respectively. Our TMDlayer leads to consistent and clear improvements in both supervised and semisupervised settings (also for transductive/nontransductive settings).
Model  Trans.  Accuracy(%) 

Matching Networks [51]  No  55.30 
Reptile [40]  No  62.74 
Prototypical Net [47]  No  65.77 
GNN [18]  No  66.41 
EGNN [25]  No  66.85 
Ours  No  68.35 
MAML [17]  BN  63.11 
Reptile + BN [40]  BN  65.99 
Relation Net [50]  BN  67.07 
MAML + Transduction [17]  Yes  66.19 
TNP [35]  Yes  69.43 
TPN (Higher K) [35]  Yes  69.86 
EGNN+Transduction [25]  Yes  76.37 
Ours+Transduction  Yes  77.78 
Labeled Ratio (5way 5shot)  
Training method  20%  40%  60%  1000% 
GNNsemi [18]  52.45  58.76    66.41 
EGNNsemi [25]  61.88  62.52  63.53  66.85 
Ours  63.14  64.32  64.83  68.35 
EGNNsemi(T) [25]  63.62  64.32  66.37  76.37 
Ours(T)  64.84  66.43  68.62  77.78 
4.5 Runtime overhead/Relation with augmentation
Runtime overhead. Our construction does involve some training time overhead because of computing the kernel matrix, and varies depending on the use case. For reference, the overhead is in §4.2, in §4.3 and in §4.4.
Relationship with data augmentation. Data augmentation and TMDLayer are complementary, not mutually exclusive. In all our experiments, the baselines use data augmentations (e.g., random rotation or cropping). Our TMDLayer offers benefits, above and beyond augmentation.
5 Discussion and Conclusions
We proposed an SDE based framework that allows a unified view of several different learning tasks in vision. Our framework is beneficial where data generation (or the data itself) can be described using stochastic processes, or more specifically diffusion operators. This is particularly useful in settings where obtaining a deterministic model of the image manifold or learning density functions are impossible or challenging due to high sample complexity requirements. Our TMDlayer does not require explicit generation of diffused samples, especially during training, making it computationally efficient. The “process” of which the provided data sample is a snapshot and whose characterization is enabled by our TMDlayer, also appears to have implications for robust learning. Indeed, if the parameters that define the process are explicitly optimized, we should be able to establish an analogy between the resultant model as a stochastic/simpler version of recent results for certified margin radius maximization [60] which often require access to Monte Carlo sampling oracles [11]. We believe that periodicity in SDEs for data augmentation is an important missing ingredient – for instance – this may help model seasonal patterns in disease progression studies for predictions, automatically. For this purpose, tools from Floquet theory may allow us to consider transformed versions of the process, potentially with simplified generators. Our code is available at https://github.com/zihangm/neuraltmdlayer .
Acknowledgments
This work was supported by NIH grants RF1 AG059312 and RF1 AG062336. SNR was supported by UIC startup funds. We thank Baba Vemuri for providing many important suggestions on formulating the ChanVese model within deep networks.
References
 [1] (2018) Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091. Cited by: Table 3.
 [2] (2020) Diffusion maps tailored to arbitrary nondegenerate itô processes. Applied and Computational Harmonic Analysis 48 (1), pp. 242–265. Cited by: §1, §3, §3.
 [3] (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.. Journal of machine learning research 7 (11). Cited by: §1.1.
 [4] (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 (6), pp. 1373–1396. Cited by: §1.1.
 [5] (2020) Longformer: the longdocument transformer. arXiv preprint arXiv:2004.05150. Cited by: §4.2.
 [6] (1997) Geodesic active contours. International journal of computer vision 22 (1), pp. 61–79. Cited by: §1.1.
 [7] (2001) Active contours without edges. IEEE Transactions on image processing 10 (2), pp. 266–277. Cited by: §1.1, §4.3.
 [8] (2018) Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–10. Cited by: §1.1.
 [9] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.1.
 [10] (2018) Neural ordinary differential equations. arXiv preprint arXiv:1806.07366. Cited by: §1.1, §1.
 [11] (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §5.
 [12] (2018) Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §1.
 [13] (2006) Diffusion maps. Applied and computational harmonic analysis 21 (1), pp. 5–30. Cited by: §2.

[14]
(2019)
Autoaugment: learning augmentation strategies from data.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 113–123. Cited by: item ii.  [15] (2019) Augmented neural odes. arXiv preprint arXiv:1904.01681. Cited by: §1.1.
 [16] (2017) Learning so(3) equivariant representations with spherical cnns. CoRR. External Links: Link, 1711.06721 Cited by: §1, §1.
 [17] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: Table 4.
 [18] (2017) Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043. Cited by: Table 4, Table 5.
 [19] (2020) PCT: point cloud transformer. arXiv preprint arXiv:2012.09688. Cited by: §4.2.2, §4.2.3, §4.2.3, §4.2.3, §4.2, Table 3.
 [20] (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. Cited by: item 2.
 [21] (2015) Hypercolumns for object segmentation and finegrained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456. Cited by: §4.3.3.
 [22] (2019) Endtoend deep convolutional active contours for image segmentation. arXiv preprint arXiv:1909.13359. Cited by: §1.1.
 [23] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.1, §4.1.
 [24] (2019) Variational physicsinformed neural networks for solving partial differential equations. arXiv preprint arXiv:1912.00873. Cited by: §1.1.
 [25] (2019) Edgelabeling graph neural network for fewshot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11–20. Cited by: Figure 1, §4.4.2, §4.4.3, §4.4.3, §4.4, Table 4, Table 5.
 [26] (2017) Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: Table 3.
 [27] (2019) Acnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: Table 3.
 [28] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
 [29] (1997) Stochastic flows and stochastic differential equations. Vol. 24, Cambridge university press. Cited by: §2.
 [30] (2018) Pointgrid: a deep network for 3d shape understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9204–9214. Cited by: Table 3.
 [31] (2018) Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 3.
 [32] (2020) Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895. Cited by: §1.1.
 [33] (2019) Fast interactive object annotation with curvegcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5257–5266. Cited by: §4.3.3.
 [34] (2019) Neural sde: stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355. Cited by: §1.1, §1.
 [35] (2019) Learning to propagate labels: transductive propagation network for fewshot learning. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: Table 4.
 [36] (1995) Shape modeling with front propagation: a level set approach. IEEE transactions on pattern analysis and machine intelligence 17 (2), pp. 158–175. Cited by: §1.1.
 [37] (2018) Learning deep structured active contours endtoend. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8877–8885. Cited by: §1.1, §4.3.1, §4.3.3.

[38]
(2020)
The geometry of semisupervised learning
. Ph.D. Thesis, Harvard University Cambridge, Massachusetts. Cited by: §1.1.  [39] (2020) Solving differential equations using deep neural networks. Neurocomputing 399, pp. 193–212. Cited by: §1.1.
 [40] (2018) On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999. Cited by: Table 4.
 [41] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: Table 3.

[42]
(2020)
Learning generalized transformation equivariant representations via autoencoding transformations
. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.  [43] (2017) Optimization as a model for fewshot learning. ICLR 2017. Cited by: §4.4.3.
 [44] (1994) Regionbased strategies for active contour models. International journal of computer vision 13 (2), pp. 229–251. Cited by: §4.3.
 [45] (1994) Total variation based image restoration with free local constraints. In Proceedings of 1st International Conference on Image Processing, Vol. 1, pp. 31–35. Cited by: §1.1.
 [46] (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.4.3.
 [47] (2017) Prototypical networks for fewshot learning. arXiv preprint arXiv:1703.05175. Cited by: Table 4.
 [48] (2019) Learning an effective equivariant 3d descriptor without supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6401–6410. Cited by: §1.
 [49] (2011) A quaternion framework for color image smoothing and segmentation. International Journal of Computer Vision 91 (3), pp. 233–250. Cited by: §4.3.
 [50] (2018) Learning to compare: relation network for fewshot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208. Cited by: Table 4.
 [51] (2016) Matching networks for one shot learning. arXiv preprint arXiv:1606.04080. Cited by: §4.4.3, Table 4.
 [52] (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: Table 3.
 [53] (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: Table 3.
 [54] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §4.2.3.
 [55] (2018) Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4606–4615. Cited by: Table 3.
 [56] (2021) Nyströmformer: a nyströmbased algorithm for approximating selfattention. arXiv preprint arXiv:2102.03902. Cited by: §4.2.
 [57] (2000) Image segmentation using deformable models. Handbook of medical imaging 2 (20), pp. 0. Cited by: §4.3.
 [58] (2020) Deep variational instance segmentation. NeurIPS 2020. Cited by: §4.3.1, Remark 2.
 [59] (2021) You only sample (almost) once: linear cost selfattention via bernoulli sampling. In International Conference on Machine Learning, pp. 12321–12332. Cited by: §4.2.
 [60] (2021) Simpler certified radius maximization by propagating covariances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7292–7301. Cited by: §5.
 [61] (2016) Generative visual manipulation on the natural image manifold. In European conference on computer vision, pp. 597–613. Cited by: §3.1.
Comments
There are no comments yet.