首页 > 科技 >

基于知识蒸馏的 BERT 模型压缩

2019-10-14 13:07:31 暂无 阅读:785 评论:0
基于知识蒸馏的 BERT 模型压缩

大数据文摘授权转载自数据派

编译:孙思琦、成宇、甘哲、刘晶晶

在曩昔一年里,说话模型的研究有了很多冲破性的进展, 好比 GPT 用来生成的句子充沛以假乱真 [ 1 ] ;BERT, XLNet, RoBERTa [ 2,3,4 ] 等等作为特征提取器更是横扫各大 NLP 榜单。然则,这些模型的参数量也相当惊人,好比 BERT-base 有一亿零九百万参数,BERT-large 的参数量则高达三亿三万万,从而导致模型的运行速渡过慢。为了提高模型的运行时间,本文率先提出了一种新的常识蒸馏 ( Knowledge Distillation ) [ 5 ] 方式来对模型进行压缩,从而在不损失太多精度的情形下,节约运行时间和内存。文章揭橥在 EMNLP 2019。

" 耐烦的常识蒸馏 " 模型

具体来说,对于句子分类类型的义务,当通俗的常识蒸馏模型用来对模型进行压缩的时候 , 平日都邑损失好多精度。原因是学生模型 ( student model ) 在进修的时候只是学到了教师模型 ( teacher model ) 最终展望的概率分布,而完全忽略了中央隐藏层的透露。

就像先生在教授生的时候,学生只记住了最终的谜底,然则对于中央的过程确完全没有进修。如许在碰到新问题的时候,学生模型犯错误的概率更高。基于这个假设,文章提出了一种损失函数,使得学生模型的隐藏层透露接近教师模型的隐藏层透露,从而让学生模型的泛化能力更强。文章称这种模型为 " 耐烦的常识蒸馏 " 模型 ( Patient Knowledge Distillation, 或许 PKD ) 。

因为对于句子分类问题,模型的展望都是基于 [ CLS ] 字符的特征透露之上,好比在这个特征上加两层全保持。是以研究者提出一个新的损失函数,使得模型可以同时学到 [ CLS ] 字符的特征透露:

个中 M 是学生的层数(好比 3,6), N 是先生模型的层数(好比 12,24),h 是 [ CLS ] 在模型中隐藏层的透露,而 i, j 则透露学生 - 先生隐藏层的对应关系,具体如下图所示。好比,对于 6 层的学生模型,在进修 12 层的教师模型的时候, 学生模型能够进修教师模型的 ( 2,4,6,8,10 ) 层隐藏层的透露 ( 左侧 PKD-skip ) , 或许教师模型最后几层的透露 ( 7,8,9,10,11, 右侧 PKD-last ) . 最后一层因为直接进修了教师模型的展望概率,是以略过了最后一个隐藏层的进修。

基于知识蒸馏的 BERT 模型压缩

验证猜测

研究者将提出的模型与模型微调 ( fine-tuning ) 和正常的常识蒸馏在 7 个句子分类的保准数据集长进行对照,在 12 层教师模型蒸馏到 6 层或许 3 层学生模型的时候,绝大部门情形下 PKD 的示意都优于两种基线模型。而且在五个数据集上 SST-2 ( 比拟于教师模型 -2.3% 正确率 ) , QQP ( -0.1% ) , MNLI-m ( -2.2% ) , MNLI-mm ( -1.8% ) , and QNLI ( -1.4% ) 的示意接近于教师模型。具体究竟拜见图表 1。从而进一步验证了研究者的猜测,进修了隐藏层透露的学生模型会优于只学教师展望概率的学生模型。

基于知识蒸馏的 BERT 模型压缩

图表 1

在速度方面,6 层 transformer 模型几乎能够将推理 ( inference ) 速度提高两倍,总参数量削减 1.64 倍;而三层 transformer 模型能够提速 3.73 倍,总参数两削减 2.4 倍。具体究竟见图表 2。

图表 2

Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI Blog 1.8 ( 2019 ) .

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 ( 2018 ) .

Yang, Zhilin, et al. "XLNet: Generalized Autoregressive Pretraining for Language Understanding." arXiv preprint arXiv:1906.08237 ( 2019 ) .

Liu, Yinhan, et al. "Roberta: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692 ( 2019 ) .

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 ( 2015 ) .

Siqi Sun: is a Research SDE in Microsoft. He is currently working on commonsense reasoning and knowledge graph related projects. Prior joining Microsoft, he was a PhD student in computer science at TTI Chicago, and before that he was an undergraduate student from school of mathematics at Fudan University.

Yu Cheng: is a senior researcher at Microsoft. His research is about deep learning in general, with specific interests in model compression, deep generative model and adversarial learning. He is also interested in solving real-world problems in computer vision and natural language processing. Yu received his Ph.D.from Northwestern University in 2015 and his bachelor from Tsinghua University in 2010. Before join Microsoft, he spent three years as a Research Staff Member at IBM Research/MIT-IBM Watson AI Lab.

Zhe Gan: is a senior researcher at Microsoft, primarily working on generative models, visual QA/dialog, machine reading comprehension ( MRC ) , and natural language generation ( NLG ) . He also has broad interests on various machine learning and NLP topics. Zhe received his PhD degree from Duke University in Spring 2018. Before that, he received his Master's and Bachelor's degree from Peking University in 2013 and 2010, respectively.

Jingjing ( JJ ) Liu: is a Principal Research Manager at Microsoft, leading a research team in NLP and Computer Vision. Her current research interests include Machine Reading Comprehension, Commonsense Reasoning, Visual QA/Dialog and Text-to-Image Generation. She received her PhD degree in Computer Science from MIT EECS in 2011. She also holds an MBA degree from Judge Business School at University of Cambridge.Before joining MSR, Dr.Liu was the Director of Product at Mobvoi Inc and Research Scientist at MIT CSAIL.

代码已经开源在:

https://github.com/intersun/PKD-for-BERT-Model-Compression

实习 / 全职编纂记者雇用 ing

到场我们,亲自体验一家专业科技媒体采写的每个细节,在最有前景的行业,和一群遍布全球最精良的人一路成长。坐标北京 · 清华东门,在大数据文摘主页对话页复原" 雇用 "认识详情。简历请直接发送至 zz@bigdatadigest.cn

点「在看」的人都变悦目了哦!

相关文章