Efficient Deep Neural Network Pruning: From Convolutional Networks to Large Language Models

Shi, Javen QinfengZhang, Miao (Harbin Institute of Technology, Shenzhen)Zhang, ZhenCheng, Hongrong2025-05-282025-05-282025https://hdl.handle.net/2440/144761Neural network pruning is a widely used compression technique. However, minimal performance degradation can make pruning resource-intensive, requiring substantial computational and memory overhead. It becomes particularly challenging when resources are limited. This thesis delves into the development of effective and resource efficient pruning methods for convolutional neural networks and large language models. We begin by conducting an in-depth comparison analysis of contemporary neural network pruning methods, evaluating their effectiveness and efficiency. Our study includes qualitative and quantitative assessments across various pruning settings, such as unstructured vs. structured and one-shot vs. iterative. It covers models ranging from small and medium-sized to large-scale. This comprehensive analysis provides valuable guidance for developing efficient and effective pruning methods. Subsequently, we propose a computationally efficient channel pruning method called Influence Function based Second-Order (IFSO). Our findings show significant differences in loss changes with and without retraining, with the latter resulting in worse Top-1 accuracy. To address this issue computationally efficiently, we propose a closed-form estimator for true loss change without retraining per mask alteration inspired by influence functions. It allows simultaneous evaluation of all channels’ importance. Extensive experiments validate the effectiveness of our method. Finally, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM). Specifically, we develop a hybrid pruning criterion called Feature Map Sensitivity (FMS) to integrate weight magnitude, activation, and gradient for scoring and employ zeroth-order optimization to estimate gradients solely through forward passes. Extensive experiments validate that MINI-LLM consistently outperforms gradient-free pruning methods and matches or surpasses back-propagation gradient-based methods while maintaining similar GPU memory to gradient-free methods.endeep neural network pruninginfluence functionmodel compressionconvolutional networkslarge language modelsEfficient Deep Neural Network Pruning: From Convolutional Networks to Large Language ModelsThesis