«上一篇
 文章快速检索 高级检索

 智能系统学报  2019, Vol. 14 Issue (4): 635-641  DOI: 10.11992/tis.201806006 0

### 引用本文

QU Zhaowei, WU Chunye, WANG Xiaoru. Aspects extraction based on semi-supervised self-training[J]. CAAI Transactions on Intelligent Systems, 2019, 14(4): 635-641. DOI: 10.11992/tis.201806006.

### 文章历史

1. 北京邮电大学 网络技术研究院，北京 100876;
2. 北京邮电大学 计算机学院，北京 100876

Aspects extraction based on semi-supervised self-training
QU Zhaowei 1, WU Chunye 1, WANG Xiaoru 2
1. Institute of Network Technology, Beijing University of Posts and Telecommunication, Beijing 100876, China;
2. College of Computer Science, Beijing University of Posts and Telecommunication, Beijing 100876, China
Abstract: Aspect extraction is a key step in opinion mining and sentiment analysis. With the development of social networks, users are increasingly inclined to make decisions based on review information and pay more attention to the fine-grained information of comments. Therefore, it is important to help users to make these decisions by quickly mining information from massive comments. Most topic-based models and clustering methods do not work well in terms of consistency in aspect extraction. The traditional supervised learning method works well, but it requires a large amount of annotation text as training data, and labeling text requires a lot of labor costs. Based on the above issues, a method for aspects extraction based on semi-supervised self-training (AESS) is proposed in this paper. The method takes full advantage of the large amount of unlabeled data that exist in the web. Words similar to seed words on the unlabeled datasets using a word vector model are found, and multiple aspects word sets that are most related to the data set are constructed. Our approach avoids a large number of text annotations and makes full use of the value of unlabeled data, and our method has made good performance in both Chinese and English datasets.
Key words: aspect extraction    word vector    semi-supervised    self-training    unlabeled data    opinion mining    seed words    similar words

1)根据研究评论数据，发现评论的针对性很强，基本是针对某项产品或者服务给出自己的体验和建议。而且数据结构具有鲜明的产品特色，句子语言简短观点明确，会经常使用到明显的方面表示词来发表意见。

2)评论往往涉及一个或者多个方面，以下一个简单的例子来自美团网(http://bj.meituan.com/meishi/)美食评论数据来说明研究的意义。例如：“口味清淡，服务员态度很好，就是价格有点贵”。这句评论涉及了对餐厅食物的“口味”“服务”以及“价格”3个方面的评价，而且对于不同的方面给出了不同的意见。采用方面表示向量来对涉及的方面进行向量表示。方面提取作为观点挖掘的第一步，来确定评论涉及的多个方面。

3)考虑到评论中涉及的含蓄表达，例如：“还挺好吃的，排队等了半小时，不过还是很好吃”。句子中并没有明确的方面表示词，但是根据关键词“好吃”可以确定是针对食物的方面意见。针对这种没有明确方面表示名词的提取方面形容词来识别方面。

1 相关工作

2 半监督自训练模型的构建 2.1 半监督自训练模型

2.2 黄金方面确定方法

 $t{f_{i,j}} = \frac{{{n_{i,j}}}}{{\sum\limits_k {{n_{k,j}}} }}$ (1)

 $id{f_i} = \log \frac{{\left| D \right|}}{{\left| {\left\{ {j:{t_i} \in {d_j}} \right\}} \right| + 1}}$ (2)

 ${\rm{TF\_IDF}}{_{i,j}} = t{f_{i,j}} \times id{f_i}$ (3)
2.3 方面表示词集的建立

 $\begin{array}{*{20}{c}} {p\left( {{w_i}|{w_1},{w_2,} \cdots ,{w_{t - 1}}} \right) \approx f\left( {{w_i},{w_{t - 1}}, \cdots ,{w_{t - n + 1}}} \right) = }\\ {g\left( {{w_i},c\left( {{w_{t - n + 1}}} \right), \cdots ,c\left( {{w_{t - 1}}} \right)} \right)} \end{array}$ (4)

 $L\left( \theta \right) = \frac{1}{T}\sum {\log f\left( {{w_t},{w_{t - 1}}, \cdots ,{w_{t - n + 1}}} \right) + R\left( \theta \right)}$ (5)

 $p\left( {{w_o}|{w_i}} \right) = \frac{{{{\rm{e}}^{U_o \cdot {{{V}}_i}}}}}{{\sum\nolimits_j {{{\rm{e}}^{{{U}}_j \cdot {{{V}}_i}}}} }}$ (6)

 $p\left( {{w_i} \in {D_1}|{\rm{context}}} \right) = \frac{1}{{1 + {{\rm{e}}^{ - {U_{{{\rm{D_{root}}}}}} \cdot {V_{{w_t}}}}}}}$ (7)

 $\begin{array}{*{20}{c}} {p\left( {{w_t}|{\rm{context}}} \right) = p\left( {{D_1} = 1|{\rm{context}}} \right)}\\ {p\left( {{D_2} = 0|{D_1} = 1} \right) \cdots p\left( {{w_k}|{D_k} = 1} \right)} \end{array}$ (8)

3 实验

3.1 实验的建立

LocLDA[23]：该方法使用了LDA的标准实现。为了防止全局主题的提取并将模型引向可评价方面，将每条评论作为一个单独的文档处理。模型的输出是对数据中每条评论的方面分布。

SAS[19]：该方法是一个混合主题模型，在用户感兴趣的类别上给定一些种子词，自动地提取类别方面术语。这个模型在已知的主题模型上，对于方面提取具有很强的竞争性。

3.2 评估和结果

 Download: 图 3 3种方法在相同的英文数据集上的3个黄金方面确定F1结果对比 Fig. 3 The F1 results that three methods determine the three gold aspects on same English data set
 Download: 图 4 3种方法在相同的中文数据集上的4个黄金方面确定结果对比 Fig. 4 The F1 results that three methods determine the four gold aspects on same Chinese data set

2)在中文数据集上，本文方法(AESS)在食物、价格和环境方面识别的F1分数高于其他方法，4个方面的召回率都高于其他方法。本文方法在中文数据集上明显优于其他2种方法，可能的原因有中文数据集是具有特色的美食评论数据，中文在语法表达上和英文不同，语句简短甚至没有固定的语法，对于主题提取比较困难，基于数据集创建词典，避免这类问题，因此效果比较好。

4 结束语

 [1] LIU Bing. Sentiment analysis and opinion mining[C]//Proceedings of the Synthesis Lectures on Human Language Technologies. Toronto, Canada, 2012: 152–153. (0) [2] 刘倩. 观点挖掘中评价对象抽取方法的研究[D]. 南京: 东南大学, 2016. LIU Qian. Research on approaches to opinion target extraction in opinion mining[D]. Nanjing: Southeast University, 2016. (0) [3] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022. (0) [4] TITOV I, MCDONALD R. Modeling online reviews with multi-grain topic models[C]//Proceedings of the 17th International Conference on World Wide Web. Beijing, China, 2008: 111–120. (0) [5] BRODY S, ELHADAD N. An unsupervised aspect-sentiment model for online reviews[C]//Proceedings of the Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, USA, 2010: 804–812. (0) [6] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537. (0) [7] PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, 2015: 2539–2544. (0) [8] PORIA S, CAMBRIA E, GELBUKH A. Aspect extraction for opinion mining with a deep convolutional neural network[J]. Knowledge-Based Systems, 2016, 108: 42-49. DOI:10.1016/j.knosys.2016.06.009 (0) [9] HE Ruidan, LEE W S, NG H T, et al. An unsupervised neural attention model for aspect extraction[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada, 2017: 388–397. (0) [10] 韩忠明, 李梦琪, 刘雯, 等. 网络评论方面级观点挖掘方法研究综述[J]. 软件学报, 2018, 29(2): 417-441. HAN Zhongming, LI Mengqi, LIU Wen, et al. Survey of studies on aspect-based opinion mining of internet[J]. Journal of Software, 2018, 29(2): 417-441. (0) [11] JIN Wei, HO H H. A novel lexicalized HMM-based learning framework for web opinion mining[C]//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Canada, 2009: 465–472. (0) [12] LI Fangtao, HAN Chao, HUANG Minle, et al. Structure-aware review mining and summarization[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Beijing, China, 2010: 653–661. (0) [13] JIN Wei, HO H H, SRIHARI R K. OpinionMiner: a novel machine learning system for web opinion mining and extraction[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France, 2009: 1195–1204. (0) [14] WANG Wenya, PAN S J, DAHLMEIER D, et al. Recursive neural conditional random fields for aspect-based sentiment analysis[J]. arXiv preprint arXiv:1603.06679, 2016. (0) [15] CHEN Huimin, SUN Maosong, TU Cunchao, et al. Neural sentiment classification with user and product attention[C]//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA, 2016: 1650–1659. (0) [16] CHINSHA T C, JOSEPH S. A syntactic approach for aspect based opinion mining[C]//Proceedings of 2015 IEEE International Conference on Semantic Computing. Anaheim, USA, 2015: 24–31. (0) [17] YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C]//Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil, 2013: 1445–1456. (0) [18] MAAS A L, DALY R E, PHAM P T, et al. Learning word vectors for sentiment analysis[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, USA, 2011: 142–150. (0) [19] WANG Linlin, LIU Kang, CAO Zhu, et al. Sentiment-aspect extraction based on restricted boltzmann machines[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China, 2015. (0) [20] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013. (0) [21] MIKOLOV T, SUTSKEVER I, CHEN Kai, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, 2013: 3111–3119. (0) [22] GANU G, ELHADAD N, MARIAN A. Beyond the stars: improving rating predictions using review text content[C]//Proceedings of the 12th International Workshop on the Web and Databases. Rhode Island, USA, 2009. (0) [23] ZHAO W X, JIANG Jing, YAN Hongfei, et al. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid[C]//Proceedings of 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, Massachusetts, USA, 2010: 56–65. (0) [24] MUKHERJEE A, LIU Bing. Aspect extraction through semi-supervised modeling[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. Jeju Island, Korea, 2012: 339–348. (0)