An Efficient Approach to Escalate the Speed of Training Convolution Neural Networks

2024-03-11 06:29PabithaAnushaJayasimhan

China Communications 2024年2期

P Pabitha,Anusha Jayasimhan

Department of Computer Technology,Anna University,Madras Institute of Technology Campus,Chennai 600044,India

Abstract: Deep neural networks excel at image identification and computer vision applications such as visual product search,facial recognition,medical image analysis,object detection,semantic segmentation,instance segmentation,and many others.In image and video recognition applications,convolutional neural networks (CNNs) are widely employed.These networks provide better performance but at a higher cost of computation.With the advent of big data,the growing scale of datasets has made processing and model training a time-consuming operation,resulting in longer training times.Moreover,these large scale datasets contain redundant data points that have minimum impact on the final outcome of the model.To address these issues,an accelerated CNN system is proposed for speeding up training by eliminating the noncritical data points during training alongwith a model compression method.Furthermore,the identification of the critical input data is performed by aggregating the data points at two levels of granularity which are used for evaluating the impact on the model output.Extensive experiments are conducted using the proposed method on CIFAR-10 dataset on ResNet models giving a 40%reduction in number of FLOPs with a degradation of just 0.11%accuracy.

Keywords: CNN;deep learning;image classification;model compression

I.INTRODUCTION

In the last decade,the massive generation of data through the internet and smart devices has led to the ever increasing need of data analysis.Digital data is also expanding exponentially and is widely available;however,using conventional software tools and technology,it is difficult,if not impossible,to manage and analyze this data.Not only the amount of data being processed each day ranges upto several petabytes but within the next few years the total data on Earth will approximately be in trillions of gigabytes [1-3].Deep neural networks have achieved state-of-the-art performance in a number of fields such as computer vision when compared to earlier techniques based on human created visual qualities.Large-scale datasets,cutting edge contemporary Graphical Processing Units(GPUs),and novel network topologies enable the creation of previously unimaginable large CNN models.Various neural network models,such as AlexNet,VGGNet,and ResNets,have grown from 8 layers to more than 100 layers.

While this massive amount of data has the potential to revolutionize every part of our civilization,extracting useful knowledge from it is not an easy undertaking.The massive and rapidly growing body of data concealed in vast quantities of non-traditional data,calls for the creation of new technologies as well as the close cooperation of interdisciplinary teams.While deep learning has proven to be effective in analyzing large or heavy datasets,they come with a high processing cost and memory requirements,which makes them a major problem for devices with limited resources.An embedded device,an IoT device,or a mobile phone all have restricted resources,making it challenging to install a standard deep model despite its broad popularity.For instance,the VGG-16 model uses more than 500 MB of storage and has 138.34 million parameters,requiring 30.94 billion float point operations to identify a single image.A model of this size can easily surpass the computational capacity of small devices.

The use of neural networks like CNNs in realworld applications is hampered by several factors.1) Model size: CNNs’ strong representation capacity stems from their millions of learnable parameters.Those parameters,as well as network structure data,must be saved on disk and read into memory during inference.For instance,storing a conventional CNN trained on ImageNet requires more than 300MB of capacity,placing a heavy burden on the capabilities of embedded systems.2)Run-time memory: CNN intermediary processes may consume more memory during inference time than storing the model parameters,even with batch size 1.This isn’t an issue for highend GPUs,but it’s prohibitively expensive for many low-power applications.3)Number of computing operations: On high-resolution photos,convolution procedures are computationally demanding.On a mobile device,a big CNN could take several minutes to process a single image,making it unsuitable for realworld applications.

As a result,a model training approach that addresses the scalability of vast datasets and model parameters while also resolving the critical issues when distributing larger models on low-resource devices is required.The following are the primary contributions of this work.

? Critical dataset identification is a pre-training step that removes redundant data points from a large dataset to reduce the processing time.During the iterative training phase,the significance and criticality of certain data instances are identified,with the goal of retaining only critical data and eliminating redundant data points.A granularity based approach is used to group the redundant data points and each group is represented by a single aggregated data point.The unimportant aggregated points are eliminated based on their effect value.

? As discussed before,model compression has become the need of the hour in order to train and deploy very deep CNN models on end devices.After the removal of redundant data elements,the efficiency of training CNN models is improvised further by model compression.The technique used to perform filter pruning maintains the model capacity and speeds up model training and inference with negligible performance impact.

? The effectiveness and efficiency of the suggested two-fold acceleration and compression approach will be demonstrated through experiments on CIFAR-10 dataset.

The rest of the paper is organized as follows.In Section II a brief look at the related work in the field of neural network acceleration and model compression strategies has been presented.The proposed methodology is described in depth in Section III.In Section IV,the datasets used in this study as well as the specifics of the implementation are discussed.Section V discusses the results and analysis.Section VI summarises the conclusion and other discussions.

II.RELATED WORK

2.1 Removal of Redundant Data Points

In the past,many researchers have devised techniques for the removal of redundant or unimportant data points from a large dataset for speeding up the convergence of a model.[4]prove that there exists atleast a 10%redundancy in ImageNet and CIFAR-10 datasets.By applying a hierarchical clustering technique,the redundant samples are removed with no major impact on the test accuracy.[5] and [6] conducted experiments to prove that different training examples play a crucial role in the convergence of training parameters at different phases of the training process.To speed up the convergence of the model,[7] dynamically filter out the training data from a given mini-batch.By using reinforcement learning,the algorithm learns to achieve the maximum validation accuracy in the least possible time with just half the number of original training examples.It learns a policy function to select the best training examples from a given mini-batch.[8]implement an importance sampling scheme which identifies only those samples that cause the maximum amount of change in the update of parameters.By reducing the variance on the gradient-norm,the method achieves faster convergence and lesser loss for image classification as well as sequence modeling problems.

2.2 Model Compression

The attention-based pruning strategy suggested by Haceneet al.[9]turns a convolution layer into a shift layer.For every channel,just one weight remains at the end which is identified as the most important connection.This results in improvements in memory usage and computation time while preserving high accuracy at the expense of needing more parameters for training but fewer for inference.H.Tessieret al.invented Selective Weight Decay,which carries out effective and continuous pruning during the training [10].It takes advantage of weight decay in order to attain network sparsity.It eliminates the need for fine-tuning once the network has been trimmed.It can be coupled with any pruning criteria or pruning structure,providing a framework for many different permutations.P.Jorgeet al.[11]implement pruning at initiation,a technique that avoids fine-tuning and precludes any changes to architecture during or after training.A novel approach was devised to calculate connection sensitivity postpruning,alongside the development of two approximation techniques for gradually enhancing connection sensitivity.These methods enable the pruning of networks at initialization,facilitating the transformation of initially unimportant parameters into crucial ones as the pruning process progresses.When pruning at high sparsity levels,this technique performs better,but when pruning at moderate sparsity levels,it performs worse.

N.Leeet al.[12]execute a new method called SNIP(Single Shot Network Pruning based on Connection Sensitivity),which is based on connection sensitivity.It prunes irrelevant connections for a specific task in a single step prior to training and works with a wide range of neural network models.They prove that the connection sensitivity measure is particularly beneficial in diagnosing significant connections in the network from a totally untrained network because this results in incredibly sparse models.Rendaet al.[13]presented learning rate rewinding as a means of retraining after pruning instead of fine tuning.The network is retrained from scratch after pruning,unlike fine-tuning,which is done at the slowest learning rate,this involves training the unpruned weights using the same learning-rate pattern as training.This retraining has been demonstrated to produce greater results than fine-tuning,albeit at a much higher cost.ThiNet,a framework described by Luoet al.[14],focuses on developing more compact and compressed CNN models,resulting in acceleration.It is a filter pruning method,where the output channels of layer i+1 decide and guide the removal of filters of layer i.AutoPrune is an autonomous network pruning model created by Xiaoet al.[15]to eliminate network redundancies for easier deployment.Rather than using original weights,it prunes the system by optimizing a set of trainable supplementary parameters.The auxiliary parameters are less susceptible to hyperparameters and more resistant to noise throughout training,and the gradient update algorithms for these supplementary parameters are designed to maintain them consistent with pruning tasks,removing network redundancy and improving recoverability automatically.

AutoCompress is an automatic structural weight pruning system presented by N.Liuet al.[16],which compresses models to speed up inference.It uses an automated technique for determining hyperparameters,such as the per-layer weight pruning rate and structured weight pruning schemes.The goal is to reduce the number of parameters or FLOPs as much as possible while maintaining precision.H.Liet al.[17]implement a CNN compression strategy in which they trim the filters from the CNN that have a minimal impact on the output accuracy.They prune the filters by determining the absolute weight total of each filter’s importance and then retraining the trimmed network to regain its accuracy.Deep compression approach for compressing deep learning models was introduced by S.Hanet al.[18].The compression is accomplished by a three-stage pipeline that includes network pruning,learned quantization of weights and weight sharing,followed by Huffman coding,resulting in a model with minimal accuracy loss.This is done to decrease the amount of data stored in neural networks as well as the amount of energy required to complete inference stages on large networks.To learn more compact CNNs,Liuet al.[19] presented a network slimming technique.With sparsity regularization,it adds a scaling factor for each channel.By applying an L1 regularization on the scaling factors during the batch normalization step,some scaling factors are suppressed to zero,which consequently zeroes out the corresponding weaker channels of a layer.Heet al.[20]proposed a new channel level pruning strategy for very deep CNN that can be used to speed them.They propose a two-step iterative approach based on LASSO regression and least square reconstruction to successfully trim each layer.The algorithm can be used in multilayer and multi-branch situations.Simplified CNNs are inference effective networks that achieve improved speed-ups while preserving comparable accuracy.

P.Singhet al.[21] use adaptive filter pruning to accelerate deep convolutional neural networks.It decreases both the total number of parameters and the total computation time.Two modules execute alternatively-one which maximizes the number of pruned filters and the other module minimizes the drop in accuracy.A.G.Howardet al.[22] created a new class of efficient models known as MobileNets,which are utilized for mobile and smaller embedded vision applications that are resource constrained.To produce a lightweight deep neural network,they developed a streamlined architecture that leverages depth wise separable convolution operations.Based on the limitations of the problem,two global hyperparameters were used to pick the proper sized model for the application,and these parameters helped in a reasonable level of accuracy tradeoff to reduce the size and latency.Hubaraet al.[23] introduced Quantized Neural Network (QNN) as an enhanced technique for quantifying neurons and weights throughout inference and training.All multiply-accumulate operators (MACs) are substituted with XNOR operators in this network.QNNs reduce memory capacity and accesses during the forward pass and substitute most mathematical operation with bitwise operations,lowering power consumption and enhancing calculation speeds.A Binarized Neural Network is created when QNNs are employed with incredibly low precision,such as when just one bit is being used for weight and activation.Table 1 summarizes all the related work in the field of network compression.In the two-fold accelerated CNN system,the network compression technique i.e soft pruning is combined with reduction of large datasets that further speeds up the training process.

Table 1.A comparison of related work in acceleration of CNN.

III.PROPOSED TWO-FOLD ACCELERATED CNN SYSTEM

Considering the scalability of enormous datasets and model parameters,during the iterative training process,the criticality of data instances is calculated using the effects values.There is a substantial number of repetitive or non-critical input data in real-world datasets,which has no effect on model parameter updating[24].As a result,the non-essential data points are recognized and eliminated,leaving only the major determining data for training.Integration with model compression methods is the next step.We use filter level pruning strategy to remove redundant model parameters in a CNN that have a minor impact on model accuracy and output.We combine this approach of removal of redundant data points with a soft filter pruning technique [25,26] to study the improvement in efficiency of training CNN models on large scale datasets.Figure 1 represents the block diagram and flow of the data between various modules.

Figure 1.Block diagram for the two-fold accelerated CNN system.

3.1 Critical Data Identification and Selection

This is further divided into two steps: Critical input data identification via aggregation and Data selection via data removal.

3.1.1 Aggregation of Similar Data

During the pre-training phase,the crucial dataset identifier converts the dataset into a structure of aggregated input data samples.Every data point is a portion of the initial input data points with similar properties.It is only necessary to generate aggregated data once.Firstly,the input data points having similar attributes are clustered together in a group.Every group is then represented with a representative aggregated data point.In order to provide both quick processing time and accuracy of results,groups are created at two granularities i.e finer grained and coarse-grained,using a stack of accumulated data points.First,subgroups of connected data points are created from the raw data.Then,a matching number of fine-grained compression data points (xF,yF) are produced using the original input data points with a specific compression ratio.Then,based on the fine-grained compression data points,coarse-grained compression data points (xC,yC) are generated.There are two reasons for this type of aggregated design of data points.First,using aggregated data points,computation of high-dimensional and massive data can be completed fast.Second,it can ensure that updating model parameters among consolidated data points and its associated input data points has similar impacts.The coarse-grained data points are smaller in number than the fine-grained data points because they can be evaluated faster.

Therefore two processes are used to combine input data points with similar attribute information to create an aggregated data point.The input data is divided across n subgroups(n ＞1),each of which contains several input data points with comparable values.Each input data point’s class output is compared to form a group.Every aggregated data point is linked to the set of corresponding original input data values.The following stage summarises the findings of the original input data points in each segment of the input data.

3.1.2 Redundant Data Removal

The last step of the pre-training phase before iterative model training,entails removing input data before training the model.Its primary concept is to acquire vital data by eliminating superfluous data based on aggregated data points.The impacts score of a coarsegrained/fine-grained aggregated point is calculated using cross-entropy loss.In multi-label classification problem,P-dimensional vectoryPdenote the output of the model trained andPbe the total number of class labels.Ifymax=max(yP) be the maximum value ofyP,the effect value ofithaggregated data point is evaluated using Eq.(1),

In this case the real category of theithinput data point corresponds toyti.

Various features have different implications on model parameter updates,as well.To train any deep learning model,the usual method requires estimating the influence of a feature over all training data in a batch.Aggregated points can be used to approximate such training samples more rapidly,speeding up estimate.

3.2 Compression of Model Using Soft Pruning

In the second phase,the critical data selected from the previous model is fed into the CNN model which is coupled with filter pruning strategy.This process is done iteratively for several epochs and different data points are selected over each epochs and different filters are pruned over different epochs.Generally a CNN network is parameterized byθ(i),1≤i ≤L.θ(i)represents the weight matrix ofithconvolutional layer andLdenotes the total number of layers in the model.The input tensor is denoted byXwith a shape ofni-1×wi-1×hi-1,wherewi-1andhi-1are the width and height of the input from previous layer andni-1is the number of channels from the previous layer.Similarly,the output tensor is denoted byYwith a shape ofni×wi×hi.Thus,the convolution operation performed over theithlayer is denoted by Eq.(2),

Yi,kdenotes thekthfeature map generated fromithlayer andfi,kdenotes thekthfilter present inithlayer.θ(i) is parameterized as four-dimensional tensorni×x×x×ni-1,wherexis the height and width of the filters,niandni-1are the output and input channels respectively.

Pruning of filters in a convolutional layers removes the corresponding output feature maps.If the pruning rate bePiforithlayer,then the number of filters of that layer gets reduced fromnitoni(1-Pi),such that shape of the output tensor changes toni(1-Pi)×wi×hi.

Previous filter pruning efforts had a hard time compressing the deep CNNs.The term “hard method” in this context refers to the practice of removing filters from a single layer of a pre-trained model and afterwards adjusting the trimmed model to make up for the performance loss.After that,the following layers are pruned and then the model is fine-tuned once more until the final layer is trimmed.After filter pruning is completed,they are not modified again in these circumstances.As a result of the omitted filters,the model’s capacity is substantially reduced,resulting in a smaller optimization area for the model to understand from,negatively reducing the compressed models’performance.

In the case of soft pruning,the approach dynamically eliminates filters while also updating pruned filters.In this methodology,we continuously update the pruned filters during the training phase throughout all the epochs as shown in Figure 2.There are numerous advantages of updating the pruned filters.In addition to maintaining the capacity of the compressed neural network model at par with the original models,it also prunes all the layers at once which makes it very time-efficient.Particularly,the soft pruning strategy can perform model pruning which is being trained from scratch as well as that has already been trained.In every epoch,the model is completely trained and optimized over the training data.Thel2-norm of all convolutional filters are evaluated in every convolutional layer after each epoch and is used as a criterion in the filter selection technique.The associated filter weights will then be set to zero to prune the chosen filters,followed by another training session.After all the steps of filter selection and elimination ideologies,the deep CNNs are pruned to provide an efficient and compact model.The filter pruning strategy is divided into four stages:

Figure 2.A soft pruning model compression technique.

Filter selection: In order to determine the relevance of each filter,lpnorm is utilized as shown in Eq.(3).Basically,a norm is used to find distance between two points.In this case,the distance may be assumed as the activation value from each filter.If the norm value is lesser,the activation values after the convolution operations is lesser,thus resulting in a lesser impact over the deep CNN model’s final prediction.If the norm value is higher,then they produce a higher activation value,which contributes highly over the model prediction.Thus,filter with smallerlpnorm will be pruned first,followed by those with greaterlpnorm.In this proposed method,l2norm is used as a filter selection criteria which provides better results over selectingl1norm.

Fi,krepresents thekthfilter present inithlayer andnirepresents the number of filters inithlayer andwiandhirepresents the dimensions of the filter inithlayer.

Filter Pruning:To perform filter pruning,all the values in the chosenniPiare made zero.As a result,the filter’s contribution to model output is momentarily eliminated.But the pruned filters are updated during back propagation,thereby maintaining the model capacity to learn from and high model performance.Pruning of all the filters of all the convolutional layers are done at once.Thus,pruning is done in parallel manner,thereby reducing the computational cost of pruning all the filters making in negligible.Also,a constant pruning rate is used for all the layers,to maintain the tradeoff between accuracy and acceleration and also reduces the complexity of maintaining too many hyper-parameters.

Reconstruction: After the pruning step,forward propagation is performed on the the model,with some of the filters being zero.But during the backpropagation stage,all the pruned filters are updated back to non-zero,thereby maintaining a larger model capacity and better optimization space.

Obtaining Compressed Model: The filter selection,filter pruning and reconstructions is performed iteratively during each training epoch.After the training process,all the weights of a pruned filter are reset to zero,which can be used as it is for inferencing or all such filters can be removed to create a compact,smaller model.Removing a filter from one layer also impacts the input channels of the next layer.Particularly,with a pruning rate ofPiin theithlayer,only the values ofni(1-Pi)filters are not null and contributes over the final prediction.In case of the previous layer,the number of input channels of theithlayer is modified fromnitoni(1-Pi-1).As a result,theithlayer can be reconstructed into a smaller structure.Figure 3 represents the architecture of the proposed accelerated methodology in detail.

Figure 3.Detailed module-wise architecture of the various phases of the two-fold accelerated CNN system.

3.3 Computational Complexity

The aggregation of data points is performed by creating groups of similar elements.Firstly,considering that there areLways to split the dataset,whereLdenotes the features of a dataset andNdenotes the total number of data points,we split the datasets equally into 2Lpartitions.The entire dataset is sorted at every point on the chosen attributeaiwhere i=1...L.For example if there exists an attributea1denoting the very first split attribute,the entire dataset will be sorted on this attribute and split into two equal halves.Then,the two subsets will be further recursively partitioned in the same way for every attributea2toaL.So the time complexity for grouping and creating aggregated data points is O(L x N).This method proves to work much faster than clustering techniques which will take time O(N x k x i) where k is the number of clusters and i is the number of iterations.It is worth noting that the grouping and aggregation is performed only once as a pre-processing step.Even though the removal of non-critical data points takes place iteratively as a part of training,the cost is negligible as compared to the aggregation.

The most time consuming step during training is the pruning of filters.Assuming that to prune the filters of a layeri,the norm of every filter needs to be examined.As seen from Eq.(3),computing the norm of every filter in layericonsumes O(nixni-1xwixhi).

IV.IMPLEMENTATION

4.1 Dataset Description

A public dataset for common image classification,CIFAR-10,is used to assess the proposed method.60000 32x32 color images in 10 classes,with 6000 images per class,make up the CIFAR-10 dataset.50,000 training images and 10,000 test images are available.Each training batch and test batch in the dataset has 10,000 images,and there are five training batches in total.Exactly 1000 images from each class were chosen at random for the test batch.A total of 5000 images from each class are included in the training batches.

4.2 Experimental Setup

The CNN model architecture which is used to implement our acceleration strategy is ResNet.In the case of the CIFAR-10 dataset,we test the proposed methodology on ResNet-20,ResNet-32,ResNet-56 and ResNet-110.The models are trained from scratch without using the pre-trained weights.The pruning rate is the same for all the layers and is set as 30%.Without any extra fine-tuning,a regular training cycle is utilized to prune the model from scratch.l2norm is the pruning criterion utilized for filter selection.The batch size is set at 128 and the categorical cross-entropy loss function is employed in the training of the ResNet-110 model across 200 epochs.A Stochastic Gradient Descent optimization is applied,with 0.0005 decay and 0.9 momentum.The learning rate is set to 0.01 at the beginning,and it is gradually decreased over a specified number of training epochs.In case of critical data identification and selection,the number of subsets need to be finalized.There are 50,000 training images and 5000 images fall under a single class.The total number of images in each subset is set to 100 and the total number of similar image subsets generated is 500.The compression rate used to create the fine-grained data point is 30%,a from the fine-grained points only a single coarse-grained point is generated.Each subset has one coarse-grained aggregated data point.We choose these optimization settings based on the existing literature on training original baseline models and the existing state-of-the-art filter pruning methods.A pruning ratio of 30% has been considered as the most optimal since it gives a good balance between reduction of FLOPs as well as negligible accuracy drop.The threshold values for critical and non-critical data,i.e.,Upper and lower bounds are equal to,respectively,the average of the effect values and the value divided by 5.A summary of all the parameter settings has been shown in Table 2

Table 2.A summary of optimization settings.

V.RESULT AND ANALYSIS

The results for the model training utilising the proposed methodology have been evaluated and shown in this part using line graphs displaying the variation of accuracy vs.number of epochs for training and validation sets and variations of cross -entropy loss vs.number of epochs is depicted in Figure 4.This indicates that the model’s accuracy and loss converges over 200 epochs and does not improve over the next epochs.

Figure 4.Epochs vs Loss graphs for train and validation sets over model training.

The accuracy of training and validation are much higher,which shows that the model has been trained properly for the image classification task and their weights could predict accurately as depicted in Figure 5.

Figure 5.Epochs vs Accuracy graphs for train and validation sets over model training.

Table 3 compares the trade-off between the drop in accuracy and the reduced number of FLOPs.This is done for ResNet-110 model over three different pruning rates which are 10%,20%and 30%and is trained from scratch.From the results observed,30%pruning rate produces a better accuracy of 93.57% and lesser accuracy drop of 0.11 is produced but the reduction in number of FLOPs is higher which is around 40.7%when compared with the results of using 10%and 20%pruning rates.Pruning rates are set as same to all the layers,and pruning is not done for the projection shortcuts in ResNet for simplification.

Table 3.Comparison of performance metrics of CIFAR-10 over the ResNet110 architecture.

Table 4 shows the total number of FLOPs performed by different ResNet model architectures with pruning and without pruning.The total number of FLOPs is shown for different ResNet models such as ResNet-110,ResNet-56,ResNet-32 and ResNet-20.Also,varying pruning rates are used to calculate how pruning rates make an impact on the FLOPs reduction rate.

Table 4.Comparison of FLOPs and FLOPs reduction rate over different pruning rates on different ResNet architectures.

Table 5 estimates the performance of the model,by choosing different filter selection strategies which is used as pruning criterion.Different pruning rates of 10%,20%and 30%are used overl1-norm andl2-norm as filter selection methods.From the results obtained,l2-norm proves to produce only a slightly better accuracy when compared withl1-norm.

Table 5.Comparison of overall accuracy of the model over different pruning rates and pruning criterion.

Table 6 compares the overall performance which includes accuracy drop as compared to the complete baseline model,and FLOPs reduction.To ensure that the proposed model’s performance for classification tasks is comparable to or better than that of state-ofthe-art,it must be compared to existing models such as PFEC[17],Network slimming[19],GAL[27],Partial Least-squares[28],NISP[29]and MIL[1].For a fair comparison,we only consider the accuracy drop for comparison,as the optimization settings for different methods may vary.Comparison states that the proposed methods reduces the FLOPs by 40.7%with an accuracy drop of 0.11% which proves to be better in terms of accuracy and FLOPs reduction rate making it effective over the existing approaches where PFEC[17] reduces the FLOPs by 38.6% with an accuracy drop of 0.61%and MIL[1]reduces the FLOPs by 34.2%with an accuracy drop of 0.19%.The FLOP reduction of network slimming [19] is 27.6% with a much larger accuracy drop of 1.73%.Even though[27]and[28]have slightly better FLOPs reduction ratio,our method outperforms them when it comes to the accuracy drop.

Table 6.Comparison of model performance of proposed and existing methods.

The training time taken for training the model for200 epochs without pruning is 12935.50 seconds(3.59 hrs)and the time taken for training the model for 200 epochs with a pruning rate of 30%is 12285.84 seconds(3.41 hrs).This observation shows that the pruned model gets trained faster than the unpruned model,and also with a negligible accuracy drop.In case of huge datasets and bigger network architectures,this difference can be increased a lot.In the model training,pruning of filters is done at the end of every epoch.If we change the interval gap in which these filters are pruned,for instance,after every 2 epochs the filters are pruned,after 5 epochs filters are pruned,etc.This can be taken as a hyper parameter,but this varies from one dataset to other dataset and according to the model architectures.When compared with other state-of the-art methods such as[17],[19]and[1],the proposed twofold accelerated CNN system,performs much better with a very minimal loss as compared to the baseline model.Moreover,the FLOPs reduction ratio is the highest comparatively which not only makes it energy efficient but also ensures that it can be deployed on resourced constrained devices for inference.After compressing the original 50,000 data points to 500 coarsegrained aggregated points,there was a difference of approximately 6 seconds observed in the training time duration.However,not much improvement was observed in the accuracy of training with coarse-grained points.

VI.CONCLUSION

In this work,we implement a two-fold approach to speed up the training of deep learning models through redundant data removal and filter pruning.In the first stage,we try to identify and select only the critical data for model training,thereby reducing the overall computational costs on unimportant data points.From the critical data selected,the model is trained upon this data which is incorporated with a filterlevel pruning strategy which allows the pruned filters to be updated back thereby enforcing sparsity during the model training and attaining a better optimization space and model capacity to learn from,thereby achieving a better performance than existing methods.Without using the pre-trained weights,this strategy allows the pre-trained models to be trained from the scratch and attain a superior performance compared to the state-of-the-art approaches.Through results obtained it is evident that,our proposed method gives a minimum of 1.08×speed-up as compared to baseline models.Our proposed work has only been tested on CIFAR-10 dataset and its impact on other benchmark as well as real life datasets is yet to be studied.In future,this proposed method can be implemented in a distributed deep learning setup to overcome the communication overhead that is a severe challenge in that domain.When models are trained on multiple worker nodes simultaneosly,the parameter exchange causes a severe network bottleneck which can be significantly reduced through our proposed acceleration method.

China Communications2024年2期

China Communications的其它文章: Resilient Satellite Communication Networks Towards Highly Dynamic and Highly Reliable Transmission; Mega-Constellations Based TT&C Resource Sharing: Keep Reliable Aeronautical Communication in an Emergency; High-Precision Doppler Frequency Estimation Based Positioning Using OTFS Modulations by Red and Blue Frequency Shift Discriminator; Blockchain-Based MCS Detection Framework of Abnormal Spectrum Usage for Satellite Spectrum Sharing Scenario; Energy-Efficient Traffic Offloading for RSMA-Based Hybrid Satellite Terrestrial Networks with Deep Reinforcement Learning; For Mega-Constellations: Edge Computing and Safety Management Based on Blockchain Technology