Skip to main content
Erschienen in: Complex & Intelligent Systems 5/2023

Open Access 18.03.2023 | Original Article

An enterprise adaptive tag extraction method based on multi-feature dynamic portrait

verfasst von: Xiang Li, Xingshuo Ding, Qian Xie, Shangbing Gao, Quanyin Zhu

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

User portrait has become a research hot spot in the field of knowledge graph in recent years and the rationality of tag extraction directly affects the quality of user portrait. However, most of the current tag extraction methods for portraits only consider the methods based on word frequency statistics and semantic clustering. These methods have some drawbacks: they cannot effectively discover the preferred themes of the enterprise, dynamically update the portrait tags, and adapt to the needs of the enterprise. In this paper, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). ATEMDP first uses K-means to measure the similarity between enterprise texts in preference division, and converts similar enterprise text clustering problems into tag feature clusters to obtain the point cluster structure containing the distribution of tag preference topics. In addition, in the multi-feature selection, the professional domain thesaurus is introduced for feature expansion, and the topic text is introduced into the Bert model as a sample set to discover the potential features of the enterprise text. In the end, in dynamic tag extraction, BiLSTM and CNN are used to extract features, and dynamic preference tags are obtained by updating enterprise text. THUCNews data set and Ente-pku data set are used for simulation, and seven other methods are considered in comparison. The experimental results indicate that ATEMDP is not only superior to other conventional methods in accuracy and F1-score, but also effectively solves the dynamic tagging problem of enterprise portrait.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

In recent years, with the rapid development of knowledge graph and big data technology, user portrait technology has attracted wide interest from both academia and industry [1]. Current popular enterprise profile technology has been widely used in e-commerce, risk assessment, market supervision and other fields [2]. Websites containing enterprise portraits not only have the information service functions of traditional websites, but also provide various tagging related services, such as hot spot analysis and enterprise recommendations [3]. As a new application of portrait technology, enterprise portrait not only includes multi-mode entity tags such as the name, location and keywords of the enterprise, but also includes many interests and preference topic tags, such as the R & D direction and business scope of the enterprise. These tags are mixed together to form a very complex structural feature, especially the various types of enterprise-centric relationship networks have become an important feature of the expansion of traditional enterprise portraits. They have similar goals, needs, and behaviors and other relationships [4], which are helpful for enterprises to have a deep understanding of the group structure and development rules of their industries.
At present, there have been some research results on the tag construction of portraits [5], which are mainly divided into three different dimensions: ontology-based statistical tag construction [6], behavior-based rule tag construction [79], data mining-based mining tag construction [1012]. On the one hand, although these different dimensional tag construction methods can discover text tags, they all have the problem of being unable to effectively discover text information with complex structural features, especially for the enterprise texts, which contain not only the multi-model entities such as the name and location of the enterprise, but also the relationship network formed by the connections between entities [13]. This type of enterprise text is more likely to lose significant feature information of the text, which leads to problems such as poor tag generalization ability, difficulty in dealing with fuzzy tags, and unreasonable tag modeling. Although some studies have tried to use integrated methods to fuse multi-dimensional information, these methods have limitations. Some methods ignore the interaction and relevance of multi-source texts, while others cannot truly and accurately reflect the characteristics of enterprises. On the other hand, tag extraction and interest transfer of user preferences in portrait technology have traditionally been challenging problems. In the task of preferential tag extraction, most of the current tag extraction only consider methods based on word frequency statistics and semantic clustering [14, 15]. These methods have achieved significant results in tasks with prominent features, but they can’t effectively discover the preferred themes when dealing with the complex feature texts of enterprise information. In the task of interest transfer, enterprise preferences will change due to external influences, presenting unique long-term preferences or short-term preferences [16]. For example, enterprise preferences are related not only to the short-term trends formed by the active users in the behavior space, but also to the long-term trend formed by the current environment in the structural space. Therefore, it is necessary to realize the dynamic update of portraits by strengthening the analysis of interest features of enterprise texts. In this paper, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). First, ATEMDP uses K-means to measure the similarity between enterprise texts in preference division and obtains the point cluster structure of enterprise preference topics. In addition, in multi-feature selection, ATEMDP establishes a multi-feature extraction model to obtain the enterprise preference tags, and solves the problem of the dynamic update of portraits by introducing new enterprise texts.
The contributions of this paper are summarized as follows:
(1)
We propose an enterprise adaptive tag extraction method based on multi-feature dynamic portraits (ATEMDP). ATEMDP can make adaptive adjustments according to the type of input enterprise information data and extract enterprise preference tags.
 
(2)
We propose a solution to update the portrait tag. The training model is used for tag updating. When an enterprise’s interests change, ATEMDP can immediately divide its preferences without updating any models or structures.
 
(3)
ATEMDP uses large-scale pre-training language Bert model and deep learning model. Compared with other traditional methods, ATEMDP has better efficiency and is more suitable for enterprise data partitioning.
 
The remainder of this paper is organized as follows. We first discuss some related work in the section “Related work“. In the section “Proposed scheme“, we describe the enterprise adaptive tag extraction method based on multi-feature dynamic portraits. In the section “Experimental analysis and discussion“, we present the experimental results and analysis. Finally, we summarize our findings and give a brief overview of future work in the section “Conclusion and future work“.
Existing research work of portrait technology mainly includes the research findings of applied portrait and the construction of portrait tag system.
In the research of portrait applications, most portrait applications are based on user portraits. Alan Cooper proposed the concept of Persona, which is a target user model based on attribute data. Chun et al. [17] proposed the concept of Corporate Character Scale and discussed the five main dimensions of Agreeableness, Enterprise, Competence, Chic and Ruthlessness, and two secondary dimensions of Informality and Machismo, which were used to evaluate the impact of enterprise reputation on employees and customers. Matova et al. [18] extended the concept of Corporate Character Scale and still used seven dimensions of tags to investigate two well-known retail enterprises and analyze and reveal their corporate images profile. Zhang et al. [19] mainly explored the causal and outcome-based factors contributing to mobile social media fatigue by constructing a mobile social media burnout theory model and analyzing user portraits, providing guidance for companies to understand the development of the trend of mobile social media fatigue. Li et al. [20] proposed a library user portrait model oriented to user’s cognitive needs by understanding user needs and the actual application of user portraits in libraries, which promoted the construction and improvement of user portraits. At present, with the rapid development of user portraits, more application research has been gradually carried out based on user portraits.
Portrait construction methods for tag system construction mainly include ontology-based and concept-based methods, topic-based and preference-based construction methods, behavior-based and log-based construction methods, multi-dimensional and fusion-based construction methods. Ontology-based and concept-based methods are widely used to describe users by using defined structured information and relationships. Calegari et al. [21] proposed that ontology is a powerful way of representing knowledge, which defines user portraits through the process of extracting knowledge from existing ontology to achieve rich semantic representation of user interests. Leung et al. [22] used the ranking learning method to construct user portraits based on positive and negative concepts according to web query logs and click records. Topic-based and preference-based construction methods are a kind of method that uses the user’s browsing and attention information to portray the user’s image. Tang et al. [23] used topic models to model text information and build user interest models from three dimensions: profile extraction, profile integration and user discovery. Behavior-based and log-based construction methods use personalized data to portray user characteristics, which play an important role in supplementing the portrait. Zhu et al. [24] proposed a context-aware user portrait construction method by mining user context logs. Multi-dimensional and fusion-based construction methods can model complex information tags and significantly improve portrait quality. Jung et al. [25] proposed a methodology for persona generation using real-time social media data for the distribution of products via online platforms. They present the overall methodological approach, data analysis process, and system development. In addition, some researchers have proposed methods based on dynamically updating user models on this basis. Nasraoui et al. [26] proposed a web usage mining framework for mining evolving user profiles in dynamic web sites. Iglesias et al. [27] studied the dynamic user portrait problem from the user’s command logs on Unix shell to update user behavior information. The above methods have achieved remarkable results in dealing with portrait problems, but there are still some drawbacks. For example, relationships between texts are not considered, and simple models cannot effectively deal with complex feature texts. To solve the problem of dynamic portrait in complex situations, the deep model provides a new idea for portrait tag modeling.
Another research is to use machine learning and deep learning for tag extraction and dynamic update [28]. Wang et al. [29] proposed a generative model DpRank, within a non-parametric Bayesian framework. By postulating generative assumptions about a user’s search behaviors, DpRank identified each individual user’s latent search interests and his/her distinct result preferences in a cooperative manner. To extract the tag of dynamic portrait more accurately, language model is widely used in feature extraction and neural network has become a common method for portrait tag modeling based on its excellent adaptive and real-time learning characteristics [30]. In the language model, Mikolov et al. [31] proposed Word2Vec to prove the validity of word representation in vector space. Pennington et al. [32] and Peters et al. [33] proposed Glove and Elmo, respectively, which achieved great success and solved the problem that Word2Vec only considered the local features of words. Google released a large-scale pre-training language model Bert [34] based on a two-way transformer in 2018, which has achieved good results in tasks such as tag extraction. In the neural network, Kim [35] systematically elaborated the TextCNN network, and first proposed using TextCNN to extract text tags. Johnson et al. [36] proposed a deep pyramid CNN model, which has low computational complexity and can efficiently represent long-range associations in text and so more global information. Although deep pyramid CNN model uses methods such as fixed equal length convolution and compressed pooling layer to reduce model training time, it still increases the time complexity of model training. Wang et al. [37] fused bidirectional RNN and pooling layer to form the RCNN model to further discover the long-distance dependence of text. This method uses both CNN, which has the advantage of extracting local features, and BiLSTM [38] to discover contextual information. Li et al. [39] also proved that contextual information can accurately describe user characteristics, and models with contextual features have better prediction accuracy. In addition, many researchers also constructed large-scale corpora for feature extension to improve tag accuracy. Zhang et al. [40] combined CRF for entity extraction based on BiGRU [41]. This method uses the fusion of word embedding features and boundary features to obtain feature vectors.
In summary, current research results of portrait mainly focus on the field of system construction and tag extraction. However, most of the methods only focus on static preferences, ignoring the dynamic updating of preference tags. When faced with complex corporate texts that have both multi-modal entities and heterogeneous relationships, the accuracy of dynamic tag extraction will decrease. Although some studies have tried to use integrated methods to fuse multi-dimensional information, these methods also have limitations. Some methods ignore the interaction or relevance of the enterprise text, while others cannot fully discover the characteristics of the corporate text, which is easy to lose important information. In view of the problem of enterprise tag modeling, especially the adaptive tag extraction of dynamic portraits, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). ATEMDP first uses a clustering algorithm to measure the similarity between enterprise texts. ATEMDP then converts similar enterprise text clustering problems into tag feature clusters to obtain the point cluster structure containing the distribution of tag preference topics. ATEMDP finally uses a neural network-based multi-feature extraction model to fully mine enterprise text information to achieve dynamic tag extraction.

Proposed scheme

Tag extraction is the basic task of tag system construction and portrait construction. To solve the problems of traditional tag extraction models, such as sparse feature vectors, difficult dynamic updating of tags, and unadaptive adjustment of tags. We propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). The model structure of ATEMDP is shown in Fig. 1. First, ATEMDP constructs an enterprise thesaurus for feature expansion and uses K-means for topic clustering and preference division of enterprise text. Then, the divided enterprise text is transferred into the Bert model to extract the hierarchical features, word embedding features, and sentence features of the enterprise text. Finally, ATEMDP uses BiLSTM and CNN in series to further discover the local and contextual features of enterprise text. ATEMDP uses the deep learning model to extract the multi-feature information of the enterprise text, and trains the sample set with the preference topics after adaptive clustering, which effectively solves the construction efficiency and accuracy of dynamic portrait tags.

Topic clustering and preference division

In the topic clustering and preference division layer of ATEMDP, we first carry out data cleaning on enterprise network text. Data cleaning mainly includes the detection and elimination of duplicate data, statistics and filling of missing data, and the screening and cleaning of abnormal data. In addition, we proceed with word segmentation, part-of-speech tagging and feature construction, and introduce professional domain thesaurus to expand features. Then, we use K-means to cluster the enterprise texts to obtain the enterprise preferred topic cluster. K-means is a cluster analysis algorithm based on distance iterative solution. Compared with other clustering methods, K-means method is simple in principle, convenient in implementation and fast in convergence, and can be widely used in different types of enterprise text classification, such as enterprise scope and enterprise profile. Finally, we map all enterprises to the generated preference topics to obtain the structured text tags of the enterprises. The process of preference division is shown in Fig. 2.
To divide the appropriate preference topic cluster structure, we calculate the Silhouette Coefficient corresponding to different K values by combining the cohesion and separation, and set the largest Silhouette Coefficient as the K value of the model. We can also adjust the parameter K according to the characteristics of the text and the needs of the enterprise. The Silhouette Coefficient is shown as
$$\begin{aligned} S = \frac{{b - a}}{{\max (a,b)}}, \end{aligned}$$
(1)
where a refers to the similarity between the sample and other samples in its own cluster, and b refers to the similarity between the sample and samples in other clusters. The pseudocode of enterprise preference topic clustering algorithm is described in Algorithm 1.

Feature extraction

Feature extraction is the basic task in tag modeling, and the way of feature extraction directly affects the quality of the ATEMDP. When faced with complex enterprise texts, we build a professional domain thesaurus to effectively deal with word embedding and ensure the effect of word segmentation. The Bert model is introduced to extract the hierarchical features of the enterprise texts, and the vector sequence \(S = \{ {T_1},{T_2}, \cdots ,{T_i}, \cdots ,{T_{{L_{\max }}}}\}, \) which integrates the full text semantics is obtained, where \({T_i}\) is the ith vector integrating semantic information and \({L_{\max }}\) is the length of the vector sequence. Many studies have shown that adding features can effectively improve the model effect [42, 43]. BiLSTM is a network that can effectively discover text context information, and CNN can focus on local features of the text. After the Bert model is used to obtain the hierarchical features of the enterprise text, the contextual and local features of enterprise text are missing, so BiLSTM and CNN are needed to obtain the contextual and local information-rich features. Since ATEMDP has rich feature information, it can achieve better results in the face of complex enterprise text. ATEMDP is also more suitable for tag extraction of complex text.
Bert model is a language model based on bi-directional transformer, which can discover the relationships between words. The bi-directional process is shown in Fig. 3. Transformer is a stack of six encoders and six decoders. The transformer receives the vector sequence and outputs the processed data sequence. The vector sequence is processed by six encoders and then transmitted to six decoders for decoding. The core of the encoder and decoder is the attention mechanism, which can understand the overall meaning of the sentence according to the key points in the sentence. By calculating the attention distribution of key and integrating it into value, the attention value is calculated, as shown in Eq. (2).
$$\begin{aligned} Attention(Q,K,V)=softmax\left( \frac{Q \cdot K^T}{\sqrt{{d_k}} }\right) , \end{aligned}$$
(2)
where Q, K, V are the input word vector matrix, and \({d_k}\) is the dimension of the input vector. In the Bert calculation process, the transformer encoder directly connects any two words in a sentence through a one-step calculation, and performs a weighted summation on the representations of all words. In this way, the distance between long-distance dependencies can be shortened, and the effective utilization of features can be improved.

Dynamically extract adaptive tags

We designed a method for dynamically extracting enterprise adaptive tags. On the basis of K-means clustering and Bert language model, the context features are discovered through the BiLSTM network, and then the local features are mined by the CNN network. The vector output sequence is generated after the maximum pooling, which can effectively integrate the context and local features.
RNN is a kind of neural network used to process sequence data. Compared with the general neural network model, RNN can effectively process the data with sequence changes. LSTM is a special RNN. LSTM solves the problems of gradient disappearance and gradient explosion in the training process of ordinary RNN. Through a simple network structure, LSTM uses a gated mechanism to retain long-term information in the sequence. LSTM realizes the effective training of the model by introducing three gate control units: forget gate, input gate, and output gate. The LSTM structure is shown in Fig. 4.
In Fig. 4, \(\sigma \) refers to the sigmoid function; f, i and o represent forget gate, input gate and output gate, respectively; W and b represent the weight matrix and bias, respectively; C represents the cell state; \(\overline{{h}} \) represents the hidden layer state. In LSTM, the model first calculates its forgetting degree: the memory state at time t is multiplied by a memory decay coefficient \({f_t}\), which is determined by the network input \({x_t}\) at time t and \({h_{t - 1}}\) at time t–1. \({f_t}\) is shown as
$$\begin{aligned} {f_t} = \sigma ({W_f}[{h_{t - 1}},{x_t}] + {b_f}). \end{aligned}$$
(3)
When there is a new input, it is necessary to calculate whether the input gate \({i_t}\) is activated. \({i_t}\) is determined by \({h_{t - 1}}\) at time t-1 and \({x_t}\) in the network input at time t. \({i_t}\) is shown as
$$\begin{aligned} {i_t} = \sigma ({W_i}[{h_{t - 1}},{x_t}] + {b_i}). \end{aligned}$$
(4)
The new time memory \(\widetilde{{C_t}}\) is obtained through a linear transformation, and it is determined by \({h_{t - 1}}\) at time t-1 and \({x_t}\) in the network input at time t. \(\widetilde{{C_t}}\) is shown as
$$\begin{aligned} \widetilde{{C_t}} = \tanh ({W_C}[{h_{t - 1}},{x_t}] + {b_C}). \end{aligned}$$
(5)
After getting the new time memory, multiply the new memory at time t by the attenuation coefficient \({i_t}\) to get the new learned memory. Then calculate the time t-1 memory \({f_t}*{C_{t - 1}}\) that needs to be retained. The above two are added together as the memory state at time t, which is shown as
$$\begin{aligned} {C_t} = {f_t}\times {C_{t - 1}} + {i_t}\times \widetilde{{C_t}}. \end{aligned}$$
(6)
Use \({h_{t - 1}}\) at time t-1 and \({x_t}\) in the network input at time t to calculate the output gate \({o_t}\), which can filter the output data and is shown as
$$\begin{aligned} {o_t} = \sigma ({W_o}[{h_{t - 1}},{x_t}] + {b_o}). \end{aligned}$$
(7)
Finally, the model outputs the result after passing through the tanh function, which is shown as
$$\begin{aligned} \overline{{h_t}} = {o_t}\times \tanh ({C_t}). \end{aligned}$$
(8)
In text tasks, one-way LSTM can well capture the semantics of the preceding text, but cannot retain the information of the following text. To compensate for the missing context information, BiLSTM carries out forward and reverse learning of text sequences, respectively. This makes it easier to find the long-distance dependence of the text sequence. The forward output \(\overrightarrow{{h_t}} \) and reverse output \(\overleftarrow{{h_t}} \) of BiLSTM are defined as shown in Eq. (9) and Eq. (10), and the final output result \({h_t}\) is shown in Eq. (11).
$$\begin{aligned}&\overrightarrow{{h_t}} = \mathrm{{LSTM}}({x_t},\overrightarrow{{h_{t - 1}}} ) ,\end{aligned}$$
(9)
$$\begin{aligned}&\overleftarrow{{h_t}} = \mathrm{{LSTM}}({x_t},\overleftarrow{{h_{t - 1}}} ) ,\end{aligned}$$
(10)
$$\begin{aligned}&{h_t} = \omega \overrightarrow{{h_t}} + \vartheta \overleftarrow{{h_t}} + {b_t} ,\end{aligned}$$
(11)
where \(\omega \), \(\vartheta \), b are the forward output weight matrix, the reverse output weight matrix and the bias, respectively.
After BiLSTM extracts contextual features, they are passed to CNN for maximum pooling processing. This process is shown in Fig. 5. In the pooling layer, the maximum value \({d_i}\) is extracted by the maximum pooling method, where \({d_i}=\max (C)\) and C is the context feature vector extracted by BiLSTM. Compared with average pooling, this process can not only reduce the size of the feature vectors, but also retain the local features of the text. Finally, all the vectors obtained after the maximum pooling are passed to the fully connected layer, and the tag extraction results are output. The detailed process is shown in Algorithm 2.

Experimental analysis and discussion

In this section, we investigate the performances of ATEMDP. First, we describe two real-world data sets. Then we describe the evaluation metrics. Finally, we conduct our experiments on the two data sets and analyze experimental results.

Data set description

THUCNews data set and the customized enterprise text Ente-pku1 data set are selected to carry out the experiments. In the experiment, we extracted ten categories and 100,000 news headlines from the THUCNews data set to verify the short text extraction effect, including finance and economics, estate, stocks, education, science, society, politics, sports, games and entertainment, with 10,000 data in each category. Ente-pku is a data set after we perform data crawling, data cleaning and filtering on the text information of enterprise websites. The Ente-pku data set contains 146,934 pieces of unstructured data and 80,000 pieces of structured data. The unstructured data in Ente-pku is unlabeled data that contains the characteristics of the enterprise, and the structured data is labeled data that has the business scope and business preferences, including four categories of retail, wholesale, metal and construction, with 20,000 entries in each category. We divided the two data sets into training set, validation set and test set by 8:1:1. We select KNN, SVM, Transformer, CNN, BiLSTM, BiLSTM-CNN and BiLSTM-Attentio for comparison to verify the superiority of ATEMDP.

Evaluation metrics

As four widely used evaluation metrics in text classification, we select accuracy, precision, recall and F1-score to evaluate the effect of the model. The accuracy is the ratio of the correct data amount to the total data amount, as shown in Eq. (12). The precision is the ratio of the amount of data predicted to be correct to the amount of data predicted to be positive, as shown in Eq. (13). The recall rate is the ratio of the amount of data that is predicted to be correct to the amount of data that is actually positive, as shown in Eq. (14). F1-score is the harmonic average of precision and recall, as shown in Eq. (15).
$$\begin{aligned}&Accuracy = \frac{{{T_p} + {T_n}}}{{{T_p} + {F_p} + {T_n} + {F_n}}} ,\end{aligned}$$
(12)
$$\begin{aligned}&Precision = \frac{{{T_p}}}{{{T_p} + {F_p}}} ,\end{aligned}$$
(13)
$$\begin{aligned}&Recall = \frac{{{T_p}}}{{{T_p} + {F_n}}} ,\end{aligned}$$
(14)
$$\begin{aligned}&F1 - score = \frac{{2Precision \times Recall}}{{Precision + Recall}} .\end{aligned}$$
(15)

Comparison and result analysis

To verify the feasibility and effectiveness of enterprise preferences division, we use K-means to cluster 146,934 unstructured enterprise texts. By calculating the silhouette coefficients corresponding to different K, taking the K corresponding to the largest silhouette coefficient as the number of clusters. Comparison results of silhouette coefficients are shown in Fig. 6. It can be seen from Fig. 6 that the silhouette coefficient fluctuates continuously with the change of the K. The maximum silhouette coefficient is reached when K is set to 9, and then it slowly decreases. Therefore, we set the K corresponding to the maximum silhouette coefficient to 9 as the parameter of the next experiment. Then we perform topic clustering and enterprise preferences division according to the parameter K, and the results are shown in Fig. 7. It can be seen from Fig. 7 that the text points in each preference cluster are dense, and there is a clear boundary between the clusters. Therefore, the separation of point clusters shows that ATEMDP can effectively divide the preferences. When K is large enough, we can also get the distribution law of enterprise preference topics, which is represented by the word cloud, as shown in Fig. 8. It can be seen from Fig. 8 that the most popular preference topics include retail, wholesale, software, etc. The results indicate that SMEs and Internet businesses have received more attention. In summary, ATEMDP can effectively divide the preference of enterprises, and has a certain generalization ability in the enterprise profile.
In dynamic tag extraction, ATEMDP compares other models to experiments on the THUCNews and Ente-pku data sets, and the results are shown in Table 1. Compared with other models, the extraction accuracy of ATEMDP on the Ente-pku data set reaches 95.05%, and the precision, recall and F1-score are all improved by more than 1.1%. On the THUCNews data set with sparse features, ATEMDP has better results, with accuracy, recall and F1-score increased by 3.46%, 3.56% and 3.53% respectively. The reason is that ATEMDP has stronger feature mining capabilities than other models, and can discover more deep-level features in data sets with sparse features, thereby improving the efficiency of tag extraction. All of these are beneficial effects brought by multi-features.
Table 1
Experimental comparison results
Models
THUCNews
Ente-pku
P/%
R/%
F1/%
P/%
R/%
F1/%
KNN
64.67
63.06
63.45
85.97
85.90
85.92
SVM
79.52
78.67
78.91
91.94
91.93
91.93
Transformer
88.25
88.08
88.09
92.80
92.65
92.66
CNN
89.52
89.46
89.47
93.42
93.43
93.42
BiLSTM
89.24
89.12
89.09
93.61
93.61
93.61
BiLSTM-CNN
89.73
89.63
89.65
93.75
93.74
93.74
BiLSTM-Attention
89.33
89.30
89.27
93.92
93.91
93.91
ATEMDP
93.19
93.19
93.16
95.05
95.05
95.05
To show the advantages and disadvantages of various models more intuitively, we use the validation set to test after each epoch. We compare the performance of each model by verifying the variation curve of accuracy and loss of the data sets. Iterative changes of each model in the THUCNews data set are shown in Figs. 9 and 10, and those in the Ente-pku data set are shown in Figs. 11 and 12. It can be seen from Figs. 9 and 11 that in the experiment on THUCNews data set and Ente-pku data set, after the first round of epoch, ATEMDP is significantly higher in other models, and is stable at more than 92% and 94%, respectively. This is because ATEMDP uses bidirectional transformer to generate feature vectors into the neural network, which can discover long-distance dependencies and hidden features in the text. BiLSTM and CNN further discover contextual features and local features. Thus, the accuracy of the model is significantly improved. However, the one-way transformer model cannot find text features effectively, which leads to poor tag extraction, and its accuracy fluctuates greatly with epoch. Figures 10 and 12 show the convergence of each model on the THUCNews data set and Ente-pku data set, respectively. It can be seen from the loss value curve that ATEMDP converges faster than other models to acquire features and can achieve better convergence effect in fewer iterations. The loss value of other models still fluctuates greatly after ATEMDP converges, and they are prone to overfitting. We also give the accuracy curve of the training set to show that this fast convergence is due to the model. Figures 13 and 14 are the accuracy curves of the THUCNews and Ente-pku training sets. It can be seen from Figs. 13 and 14 that the accuracy of the training set has been fluctuating in a reasonable range, so this fast convergence does not prematurely fall into the local optimum.
The confusion matrix can visually represent the extraction effect of different models on each category. Figure 15 shows the statistical results of the confusion matrix of the Ente-pku test set. As can be seen from Fig. 15, there are obvious differences in classification among different models, and the accuracy of the deep learning model is generally higher than that of other machine learning models. The reason is that the deep learning model can better obtain features to adapt to complex enterprise texts. However, due to the sparse features, the deep learning model still performs poorly in some categories. The ATEMDP model with multi-features not only has a good classification effect on categories 2 and 3 with obvious features, but also has a great improvement in categories 1 and 4 with sparse features and difficult delineation. Compared with other models, ATEMDP has a good division effect on data sets with strong features, but the structure of ATEMDP is more complicated, which will sacrifice more computer performance in the same case.

Conclusion and future work

We have proposed an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP) to solve the tag updating problem of dynamic portraits. ATEMDP first uses K-means to measure the similarity between enterprise texts and transforms the similar enterprise text clustering problem into tagged feature clusters to obtain a point cluster structure containing tag preference topics. In addition, ATEMDP uses the subject texts as a sample set into the Bert model to discover the potential features of the enterprise texts in the multi-feature selection. Finally, ATEMDP uses multi-feature extraction model based on neural network to fully mine enterprise text information to achieve dynamic tag extraction. ATEMDP makes full use of the feature information of text and can make adaptive adjustment according to the data type. When the interests of an enterprise change, ATEMDP don’t need to update any models or structures, and immediately divide the preferences, which effectively solves the problem of the difficulty in updating enterprise portrait tags and the low accuracy of extraction.
For future research work, our research will focus on three aspects: (1) we would like to embed tag data of enterprises to improve the accuracy of the model, such as enterprises contextual information; (2) we would like to explore replacing K-means method with other clustering methods for better model performance; (3) we will continue to improve ATEMDP and use the idea of knowledge distillation to output simple text in advance, so as to avoid redundant calculation of samples.

Declarations

Conflict of interest

Not applicable.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
6.
Zurück zum Zitat Maria G, Akrivi K, Costas V, George L, Constantin H (2007) Creating an ontology for the user profile: method and applications. In: In Proceedings AI* AI Workshop RCIS. Citeseer Maria G, Akrivi K, Costas V, George L, Constantin H (2007) Creating an ontology for the user profile: method and applications. In: In Proceedings AI* AI Workshop RCIS. Citeseer
13.
Zurück zum Zitat Wu L, Ge Y, Liu Q, Chen E, Long B, Huang Z (2016) Modeling users preferences and social links in social networking services: a joint-evolving perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, 30 Wu L, Ge Y, Liu Q, Chen E, Long B, Huang Z (2016) Modeling users preferences and social links in social networking services: a joint-evolving perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, 30
25.
Zurück zum Zitat Jung SG, An J, Kwak H, Ahmad M, Nielsen L, Jansen BJ (2017) Persona generation from aggregated social media data. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, p 1748–1755. https://doi.org/10.1145/3027063.3053120 Jung SG, An J, Kwak H, Ahmad M, Nielsen L, Jansen BJ (2017) Persona generation from aggregated social media data. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, p 1748–1755. https://​doi.​org/​10.​1145/​3027063.​3053120
31.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, p 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, p 3111–3119
32.
35.
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p 1746–1751 Kim Y (2014) Convolutional neural networks for sentence classification. In: the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p 1746–1751
36.
Zurück zum Zitat Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570. https://doi.org/10.18653/v1/P17-1052 Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570. https://​doi.​org/​10.​18653/​v1/​P17-1052
43.
Zurück zum Zitat Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-Bert: Enabling language representation with knowledge graph. AAAI Conf Artif Intell 34:2901–2908 Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-Bert: Enabling language representation with knowledge graph. AAAI Conf Artif Intell 34:2901–2908
Metadaten
Titel
An enterprise adaptive tag extraction method based on multi-feature dynamic portrait
verfasst von
Xiang Li
Xingshuo Ding
Qian Xie
Shangbing Gao
Quanyin Zhu
Publikationsdatum
18.03.2023
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 5/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-01029-z

Weitere Artikel der Ausgabe 5/2023

Complex & Intelligent Systems 5/2023 Zur Ausgabe

Premium Partner