nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 18.03.2023 | Original Article

An enterprise adaptive tag extraction method based on multi-feature dynamic portrait

verfasst von: Xiang Li, Xingshuo Ding, Qian Xie, Shangbing Gao, Quanyin Zhu

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

User portrait has become a research hot spot in the field of knowledge graph in recent years and the rationality of tag extraction directly affects the quality of user portrait. However, most of the current tag extraction methods for portraits only consider the methods based on word frequency statistics and semantic clustering. These methods have some drawbacks: they cannot effectively discover the preferred themes of the enterprise, dynamically update the portrait tags, and adapt to the needs of the enterprise. In this paper, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). ATEMDP first uses K-means to measure the similarity between enterprise texts in preference division, and converts similar enterprise text clustering problems into tag feature clusters to obtain the point cluster structure containing the distribution of tag preference topics. In addition, in the multi-feature selection, the professional domain thesaurus is introduced for feature expansion, and the topic text is introduced into the Bert model as a sample set to discover the potential features of the enterprise text. In the end, in dynamic tag extraction, BiLSTM and CNN are used to extract features, and dynamic preference tags are obtained by updating enterprise text. THUCNews data set and Ente-pku data set are used for simulation, and seven other methods are considered in comparison. The experimental results indicate that ATEMDP is not only superior to other conventional methods in accuracy and F1-score, but also effectively solves the dynamic tagging problem of enterprise portrait.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

In recent years, with the rapid development of knowledge graph and big data technology, user portrait technology has attracted wide interest from both academia and industry [1]. Current popular enterprise profile technology has been widely used in e-commerce, risk assessment, market supervision and other fields [2]. Websites containing enterprise portraits not only have the information service functions of traditional websites, but also provide various tagging related services, such as hot spot analysis and enterprise recommendations [3]. As a new application of portrait technology, enterprise portrait not only includes multi-mode entity tags such as the name, location and keywords of the enterprise, but also includes many interests and preference topic tags, such as the R & D direction and business scope of the enterprise. These tags are mixed together to form a very complex structural feature, especially the various types of enterprise-centric relationship networks have become an important feature of the expansion of traditional enterprise portraits. They have similar goals, needs, and behaviors and other relationships [4], which are helpful for enterprises to have a deep understanding of the group structure and development rules of their industries.

At present, there have been some research results on the tag construction of portraits [5], which are mainly divided into three different dimensions: ontology-based statistical tag construction [6], behavior-based rule tag construction [7‐9], data mining-based mining tag construction [10‐12]. On the one hand, although these different dimensional tag construction methods can discover text tags, they all have the problem of being unable to effectively discover text information with complex structural features, especially for the enterprise texts, which contain not only the multi-model entities such as the name and location of the enterprise, but also the relationship network formed by the connections between entities [13]. This type of enterprise text is more likely to lose significant feature information of the text, which leads to problems such as poor tag generalization ability, difficulty in dealing with fuzzy tags, and unreasonable tag modeling. Although some studies have tried to use integrated methods to fuse multi-dimensional information, these methods have limitations. Some methods ignore the interaction and relevance of multi-source texts, while others cannot truly and accurately reflect the characteristics of enterprises. On the other hand, tag extraction and interest transfer of user preferences in portrait technology have traditionally been challenging problems. In the task of preferential tag extraction, most of the current tag extraction only consider methods based on word frequency statistics and semantic clustering [14, 15]. These methods have achieved significant results in tasks with prominent features, but they can’t effectively discover the preferred themes when dealing with the complex feature texts of enterprise information. In the task of interest transfer, enterprise preferences will change due to external influences, presenting unique long-term preferences or short-term preferences [16]. For example, enterprise preferences are related not only to the short-term trends formed by the active users in the behavior space, but also to the long-term trend formed by the current environment in the structural space. Therefore, it is necessary to realize the dynamic update of portraits by strengthening the analysis of interest features of enterprise texts. In this paper, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). First, ATEMDP uses K-means to measure the similarity between enterprise texts in preference division and obtains the point cluster structure of enterprise preference topics. In addition, in multi-feature selection, ATEMDP establishes a multi-feature extraction model to obtain the enterprise preference tags, and solves the problem of the dynamic update of portraits by introducing new enterprise texts.

The contributions of this paper are summarized as follows:

(1)

We propose an enterprise adaptive tag extraction method based on multi-feature dynamic portraits (ATEMDP). ATEMDP can make adaptive adjustments according to the type of input enterprise information data and extract enterprise preference tags.

(2)

We propose a solution to update the portrait tag. The training model is used for tag updating. When an enterprise’s interests change, ATEMDP can immediately divide its preferences without updating any models or structures.

(3)

ATEMDP uses large-scale pre-training language Bert model and deep learning model. Compared with other traditional methods, ATEMDP has better efficiency and is more suitable for enterprise data partitioning.

The remainder of this paper is organized as follows. We first discuss some related work in the section “Related work“. In the section “Proposed scheme“, we describe the enterprise adaptive tag extraction method based on multi-feature dynamic portraits. In the section “Experimental analysis and discussion“, we present the experimental results and analysis. Finally, we summarize our findings and give a brief overview of future work in the section “Conclusion and future work“.

Existing research work of portrait technology mainly includes the research findings of applied portrait and the construction of portrait tag system.

In the research of portrait applications, most portrait applications are based on user portraits. Alan Cooper proposed the concept of Persona, which is a target user model based on attribute data. Chun et al. [17] proposed the concept of Corporate Character Scale and discussed the five main dimensions of Agreeableness, Enterprise, Competence, Chic and Ruthlessness, and two secondary dimensions of Informality and Machismo, which were used to evaluate the impact of enterprise reputation on employees and customers. Matova et al. [18] extended the concept of Corporate Character Scale and still used seven dimensions of tags to investigate two well-known retail enterprises and analyze and reveal their corporate images profile. Zhang et al. [19] mainly explored the causal and outcome-based factors contributing to mobile social media fatigue by constructing a mobile social media burnout theory model and analyzing user portraits, providing guidance for companies to understand the development of the trend of mobile social media fatigue. Li et al. [20] proposed a library user portrait model oriented to user’s cognitive needs by understanding user needs and the actual application of user portraits in libraries, which promoted the construction and improvement of user portraits. At present, with the rapid development of user portraits, more application research has been gradually carried out based on user portraits.

Portrait construction methods for tag system construction mainly include ontology-based and concept-based methods, topic-based and preference-based construction methods, behavior-based and log-based construction methods, multi-dimensional and fusion-based construction methods. Ontology-based and concept-based methods are widely used to describe users by using defined structured information and relationships. Calegari et al. [21] proposed that ontology is a powerful way of representing knowledge, which defines user portraits through the process of extracting knowledge from existing ontology to achieve rich semantic representation of user interests. Leung et al. [22] used the ranking learning method to construct user portraits based on positive and negative concepts according to web query logs and click records. Topic-based and preference-based construction methods are a kind of method that uses the user’s browsing and attention information to portray the user’s image. Tang et al. [23] used topic models to model text information and build user interest models from three dimensions: profile extraction, profile integration and user discovery. Behavior-based and log-based construction methods use personalized data to portray user characteristics, which play an important role in supplementing the portrait. Zhu et al. [24] proposed a context-aware user portrait construction method by mining user context logs. Multi-dimensional and fusion-based construction methods can model complex information tags and significantly improve portrait quality. Jung et al. [25] proposed a methodology for persona generation using real-time social media data for the distribution of products via online platforms. They present the overall methodological approach, data analysis process, and system development. In addition, some researchers have proposed methods based on dynamically updating user models on this basis. Nasraoui et al. [26] proposed a web usage mining framework for mining evolving user profiles in dynamic web sites. Iglesias et al. [27] studied the dynamic user portrait problem from the user’s command logs on Unix shell to update user behavior information. The above methods have achieved remarkable results in dealing with portrait problems, but there are still some drawbacks. For example, relationships between texts are not considered, and simple models cannot effectively deal with complex feature texts. To solve the problem of dynamic portrait in complex situations, the deep model provides a new idea for portrait tag modeling.

Another research is to use machine learning and deep learning for tag extraction and dynamic update [28]. Wang et al. [29] proposed a generative model DpRank, within a non-parametric Bayesian framework. By postulating generative assumptions about a user’s search behaviors, DpRank identified each individual user’s latent search interests and his/her distinct result preferences in a cooperative manner. To extract the tag of dynamic portrait more accurately, language model is widely used in feature extraction and neural network has become a common method for portrait tag modeling based on its excellent adaptive and real-time learning characteristics [30]. In the language model, Mikolov et al. [31] proposed Word2Vec to prove the validity of word representation in vector space. Pennington et al. [32] and Peters et al. [33] proposed Glove and Elmo, respectively, which achieved great success and solved the problem that Word2Vec only considered the local features of words. Google released a large-scale pre-training language model Bert [34] based on a two-way transformer in 2018, which has achieved good results in tasks such as tag extraction. In the neural network, Kim [35] systematically elaborated the TextCNN network, and first proposed using TextCNN to extract text tags. Johnson et al. [36] proposed a deep pyramid CNN model, which has low computational complexity and can efficiently represent long-range associations in text and so more global information. Although deep pyramid CNN model uses methods such as fixed equal length convolution and compressed pooling layer to reduce model training time, it still increases the time complexity of model training. Wang et al. [37] fused bidirectional RNN and pooling layer to form the RCNN model to further discover the long-distance dependence of text. This method uses both CNN, which has the advantage of extracting local features, and BiLSTM [38] to discover contextual information. Li et al. [39] also proved that contextual information can accurately describe user characteristics, and models with contextual features have better prediction accuracy. In addition, many researchers also constructed large-scale corpora for feature extension to improve tag accuracy. Zhang et al. [40] combined CRF for entity extraction based on BiGRU [41]. This method uses the fusion of word embedding features and boundary features to obtain feature vectors.

In summary, current research results of portrait mainly focus on the field of system construction and tag extraction. However, most of the methods only focus on static preferences, ignoring the dynamic updating of preference tags. When faced with complex corporate texts that have both multi-modal entities and heterogeneous relationships, the accuracy of dynamic tag extraction will decrease. Although some studies have tried to use integrated methods to fuse multi-dimensional information, these methods also have limitations. Some methods ignore the interaction or relevance of the enterprise text, while others cannot fully discover the characteristics of the corporate text, which is easy to lose important information. In view of the problem of enterprise tag modeling, especially the adaptive tag extraction of dynamic portraits, we propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). ATEMDP first uses a clustering algorithm to measure the similarity between enterprise texts. ATEMDP then converts similar enterprise text clustering problems into tag feature clusters to obtain the point cluster structure containing the distribution of tag preference topics. ATEMDP finally uses a neural network-based multi-feature extraction model to fully mine enterprise text information to achieve dynamic tag extraction.

Proposed scheme

Tag extraction is the basic task of tag system construction and portrait construction. To solve the problems of traditional tag extraction models, such as sparse feature vectors, difficult dynamic updating of tags, and unadaptive adjustment of tags. We propose an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP). The model structure of ATEMDP is shown in Fig. 1. First, ATEMDP constructs an enterprise thesaurus for feature expansion and uses K-means for topic clustering and preference division of enterprise text. Then, the divided enterprise text is transferred into the Bert model to extract the hierarchical features, word embedding features, and sentence features of the enterprise text. Finally, ATEMDP uses BiLSTM and CNN in series to further discover the local and contextual features of enterprise text. ATEMDP uses the deep learning model to extract the multi-feature information of the enterprise text, and trains the sample set with the preference topics after adaptive clustering, which effectively solves the construction efficiency and accuracy of dynamic portrait tags.

Topic clustering and preference division

In the topic clustering and preference division layer of ATEMDP, we first carry out data cleaning on enterprise network text. Data cleaning mainly includes the detection and elimination of duplicate data, statistics and filling of missing data, and the screening and cleaning of abnormal data. In addition, we proceed with word segmentation, part-of-speech tagging and feature construction, and introduce professional domain thesaurus to expand features. Then, we use K-means to cluster the enterprise texts to obtain the enterprise preferred topic cluster. K-means is a cluster analysis algorithm based on distance iterative solution. Compared with other clustering methods, K-means method is simple in principle, convenient in implementation and fast in convergence, and can be widely used in different types of enterprise text classification, such as enterprise scope and enterprise profile. Finally, we map all enterprises to the generated preference topics to obtain the structured text tags of the enterprises. The process of preference division is shown in Fig. 2.

To divide the appropriate preference topic cluster structure, we calculate the Silhouette Coefficient corresponding to different K values by combining the cohesion and separation, and set the largest Silhouette Coefficient as the K value of the model. We can also adjust the parameter K according to the characteristics of the text and the needs of the enterprise. The Silhouette Coefficient is shown as

$$\begin{aligned} S = \frac{{b - a}}{{\max (a,b)}}, \end{aligned}$$

(1)

where a refers to the similarity between the sample and other samples in its own cluster, and b refers to the similarity between the sample and samples in other clusters. The pseudocode of enterprise preference topic clustering algorithm is described in Algorithm 1.

Feature extraction

Feature extraction is the basic task in tag modeling, and the way of feature extraction directly affects the quality of the ATEMDP. When faced with complex enterprise texts, we build a professional domain thesaurus to effectively deal with word embedding and ensure the effect of word segmentation. The Bert model is introduced to extract the hierarchical features of the enterprise texts, and the vector sequence $S = \{ {T_1},{T_2}, \cdots ,{T_i}, \cdots ,{T_{{L_{\max }}}}\}, $ which integrates the full text semantics is obtained, where ${T_i}$ is the ith vector integrating semantic information and ${L_{\max }}$ is the length of the vector sequence. Many studies have shown that adding features can effectively improve the model effect [42, 43]. BiLSTM is a network that can effectively discover text context information, and CNN can focus on local features of the text. After the Bert model is used to obtain the hierarchical features of the enterprise text, the contextual and local features of enterprise text are missing, so BiLSTM and CNN are needed to obtain the contextual and local information-rich features. Since ATEMDP has rich feature information, it can achieve better results in the face of complex enterprise text. ATEMDP is also more suitable for tag extraction of complex text.

Bert model is a language model based on bi-directional transformer, which can discover the relationships between words. The bi-directional process is shown in Fig. 3. Transformer is a stack of six encoders and six decoders. The transformer receives the vector sequence and outputs the processed data sequence. The vector sequence is processed by six encoders and then transmitted to six decoders for decoding. The core of the encoder and decoder is the attention mechanism, which can understand the overall meaning of the sentence according to the key points in the sentence. By calculating the attention distribution of key and integrating it into value, the attention value is calculated, as shown in Eq. (2).

$$\begin{aligned} Attention(Q,K,V)=softmax\left( \frac{Q \cdot K^T}{\sqrt{{d_k}} }\right) , \end{aligned}$$

(2)

where Q, K, V are the input word vector matrix, and ${d_k}$ is the dimension of the input vector. In the Bert calculation process, the transformer encoder directly connects any two words in a sentence through a one-step calculation, and performs a weighted summation on the representations of all words. In this way, the distance between long-distance dependencies can be shortened, and the effective utilization of features can be improved.

Dynamically extract adaptive tags

We designed a method for dynamically extracting enterprise adaptive tags. On the basis of K-means clustering and Bert language model, the context features are discovered through the BiLSTM network, and then the local features are mined by the CNN network. The vector output sequence is generated after the maximum pooling, which can effectively integrate the context and local features.

RNN is a kind of neural network used to process sequence data. Compared with the general neural network model, RNN can effectively process the data with sequence changes. LSTM is a special RNN. LSTM solves the problems of gradient disappearance and gradient explosion in the training process of ordinary RNN. Through a simple network structure, LSTM uses a gated mechanism to retain long-term information in the sequence. LSTM realizes the effective training of the model by introducing three gate control units: forget gate, input gate, and output gate. The LSTM structure is shown in Fig. 4.

In Fig. 4, $\sigma $ refers to the sigmoid function; f, i and o represent forget gate, input gate and output gate, respectively; W and b represent the weight matrix and bias, respectively; C represents the cell state; $\overline{{h}} $ represents the hidden layer state. In LSTM, the model first calculates its forgetting degree: the memory state at time t is multiplied by a memory decay coefficient ${f_t}$, which is determined by the network input ${x_t}$ at time t and ${h_{t - 1}}$ at time t–1. ${f_t}$ is shown as

$$\begin{aligned} {f_t} = \sigma ({W_f}[{h_{t - 1}},{x_t}] + {b_f}). \end{aligned}$$

(3)

When there is a new input, it is necessary to calculate whether the input gate ${i_t}$ is activated. ${i_t}$ is determined by ${h_{t - 1}}$ at time t-1 and ${x_t}$ in the network input at time t. ${i_t}$ is shown as

$$\begin{aligned} {i_t} = \sigma ({W_i}[{h_{t - 1}},{x_t}] + {b_i}). \end{aligned}$$

(4)

The new time memory $\widetilde{{C_t}}$ is obtained through a linear transformation, and it is determined by ${h_{t - 1}}$ at time t-1 and ${x_t}$ in the network input at time t. $\widetilde{{C_t}}$ is shown as

$$\begin{aligned} \widetilde{{C_t}} = \tanh ({W_C}[{h_{t - 1}},{x_t}] + {b_C}). \end{aligned}$$

(5)

After getting the new time memory, multiply the new memory at time t by the attenuation coefficient ${i_t}$ to get the new learned memory. Then calculate the time t-1 memory ${f_t}*{C_{t - 1}}$ that needs to be retained. The above two are added together as the memory state at time t, which is shown as

$$\begin{aligned} {C_t} = {f_t}\times {C_{t - 1}} + {i_t}\times \widetilde{{C_t}}. \end{aligned}$$

(6)

Use ${h_{t - 1}}$ at time t-1 and ${x_t}$ in the network input at time t to calculate the output gate ${o_t}$, which can filter the output data and is shown as

$$\begin{aligned} {o_t} = \sigma ({W_o}[{h_{t - 1}},{x_t}] + {b_o}). \end{aligned}$$

(7)

Finally, the model outputs the result after passing through the tanh function, which is shown as

$$\begin{aligned} \overline{{h_t}} = {o_t}\times \tanh ({C_t}). \end{aligned}$$

(8)

In text tasks, one-way LSTM can well capture the semantics of the preceding text, but cannot retain the information of the following text. To compensate for the missing context information, BiLSTM carries out forward and reverse learning of text sequences, respectively. This makes it easier to find the long-distance dependence of the text sequence. The forward output $\overrightarrow{{h_t}} $ and reverse output $\overleftarrow{{h_t}} $ of BiLSTM are defined as shown in Eq. (9) and Eq. (10), and the final output result ${h_t}$ is shown in Eq. (11).

$$\begin{aligned}&\overrightarrow{{h_t}} = \mathrm{{LSTM}}({x_t},\overrightarrow{{h_{t - 1}}} ) ,\end{aligned}$$

(9)

$$\begin{aligned}&\overleftarrow{{h_t}} = \mathrm{{LSTM}}({x_t},\overleftarrow{{h_{t - 1}}} ) ,\end{aligned}$$

(10)

$$\begin{aligned}&{h_t} = \omega \overrightarrow{{h_t}} + \vartheta \overleftarrow{{h_t}} + {b_t} ,\end{aligned}$$

(11)

where $\omega $, $\vartheta $, b are the forward output weight matrix, the reverse output weight matrix and the bias, respectively.

After BiLSTM extracts contextual features, they are passed to CNN for maximum pooling processing. This process is shown in Fig. 5. In the pooling layer, the maximum value ${d_i}$ is extracted by the maximum pooling method, where ${d_i}=\max (C)$ and C is the context feature vector extracted by BiLSTM. Compared with average pooling, this process can not only reduce the size of the feature vectors, but also retain the local features of the text. Finally, all the vectors obtained after the maximum pooling are passed to the fully connected layer, and the tag extraction results are output. The detailed process is shown in Algorithm 2.

Experimental analysis and discussion

In this section, we investigate the performances of ATEMDP. First, we describe two real-world data sets. Then we describe the evaluation metrics. Finally, we conduct our experiments on the two data sets and analyze experimental results.

Data set description

THUCNews data set and the customized enterprise text Ente-pku¹ data set are selected to carry out the experiments. In the experiment, we extracted ten categories and 100,000 news headlines from the THUCNews data set to verify the short text extraction effect, including finance and economics, estate, stocks, education, science, society, politics, sports, games and entertainment, with 10,000 data in each category. Ente-pku is a data set after we perform data crawling, data cleaning and filtering on the text information of enterprise websites. The Ente-pku data set contains 146,934 pieces of unstructured data and 80,000 pieces of structured data. The unstructured data in Ente-pku is unlabeled data that contains the characteristics of the enterprise, and the structured data is labeled data that has the business scope and business preferences, including four categories of retail, wholesale, metal and construction, with 20,000 entries in each category. We divided the two data sets into training set, validation set and test set by 8:1:1. We select KNN, SVM, Transformer, CNN, BiLSTM, BiLSTM-CNN and BiLSTM-Attentio for comparison to verify the superiority of ATEMDP.

Evaluation metrics

As four widely used evaluation metrics in text classification, we select accuracy, precision, recall and F1-score to evaluate the effect of the model. The accuracy is the ratio of the correct data amount to the total data amount, as shown in Eq. (12). The precision is the ratio of the amount of data predicted to be correct to the amount of data predicted to be positive, as shown in Eq. (13). The recall rate is the ratio of the amount of data that is predicted to be correct to the amount of data that is actually positive, as shown in Eq. (14). F1-score is the harmonic average of precision and recall, as shown in Eq. (15).

$$\begin{aligned}&Accuracy = \frac{{{T_p} + {T_n}}}{{{T_p} + {F_p} + {T_n} + {F_n}}} ,\end{aligned}$$

(12)

$$\begin{aligned}&Precision = \frac{{{T_p}}}{{{T_p} + {F_p}}} ,\end{aligned}$$

(13)

$$\begin{aligned}&Recall = \frac{{{T_p}}}{{{T_p} + {F_n}}} ,\end{aligned}$$

(14)

$$\begin{aligned}&F1 - score = \frac{{2Precision \times Recall}}{{Precision + Recall}} .\end{aligned}$$

(15)

Comparison and result analysis

To verify the feasibility and effectiveness of enterprise preferences division, we use K-means to cluster 146,934 unstructured enterprise texts. By calculating the silhouette coefficients corresponding to different K, taking the K corresponding to the largest silhouette coefficient as the number of clusters. Comparison results of silhouette coefficients are shown in Fig. 6. It can be seen from Fig. 6 that the silhouette coefficient fluctuates continuously with the change of the K. The maximum silhouette coefficient is reached when K is set to 9, and then it slowly decreases. Therefore, we set the K corresponding to the maximum silhouette coefficient to 9 as the parameter of the next experiment. Then we perform topic clustering and enterprise preferences division according to the parameter K, and the results are shown in Fig. 7. It can be seen from Fig. 7 that the text points in each preference cluster are dense, and there is a clear boundary between the clusters. Therefore, the separation of point clusters shows that ATEMDP can effectively divide the preferences. When K is large enough, we can also get the distribution law of enterprise preference topics, which is represented by the word cloud, as shown in Fig. 8. It can be seen from Fig. 8 that the most popular preference topics include retail, wholesale, software, etc. The results indicate that SMEs and Internet businesses have received more attention. In summary, ATEMDP can effectively divide the preference of enterprises, and has a certain generalization ability in the enterprise profile.

In dynamic tag extraction, ATEMDP compares other models to experiments on the THUCNews and Ente-pku data sets, and the results are shown in Table 1. Compared with other models, the extraction accuracy of ATEMDP on the Ente-pku data set reaches 95.05%, and the precision, recall and F1-score are all improved by more than 1.1%. On the THUCNews data set with sparse features, ATEMDP has better results, with accuracy, recall and F1-score increased by 3.46%, 3.56% and 3.53% respectively. The reason is that ATEMDP has stronger feature mining capabilities than other models, and can discover more deep-level features in data sets with sparse features, thereby improving the efficiency of tag extraction. All of these are beneficial effects brought by multi-features.

Table 1

Experimental comparison results

Models	THUCNews			Ente-pku
Models	P/%	R/%	F1/%	P/%	R/%	F1/%
KNN	64.67	63.06	63.45	85.97	85.90	85.92
SVM	79.52	78.67	78.91	91.94	91.93	91.93
Transformer	88.25	88.08	88.09	92.80	92.65	92.66
CNN	89.52	89.46	89.47	93.42	93.43	93.42
BiLSTM	89.24	89.12	89.09	93.61	93.61	93.61
BiLSTM-CNN	89.73	89.63	89.65	93.75	93.74	93.74
BiLSTM-Attention	89.33	89.30	89.27	93.92	93.91	93.91
ATEMDP	93.19	93.19	93.16	95.05	95.05	95.05

To show the advantages and disadvantages of various models more intuitively, we use the validation set to test after each epoch. We compare the performance of each model by verifying the variation curve of accuracy and loss of the data sets. Iterative changes of each model in the THUCNews data set are shown in Figs. 9 and 10, and those in the Ente-pku data set are shown in Figs. 11 and 12. It can be seen from Figs. 9 and 11 that in the experiment on THUCNews data set and Ente-pku data set, after the first round of epoch, ATEMDP is significantly higher in other models, and is stable at more than 92% and 94%, respectively. This is because ATEMDP uses bidirectional transformer to generate feature vectors into the neural network, which can discover long-distance dependencies and hidden features in the text. BiLSTM and CNN further discover contextual features and local features. Thus, the accuracy of the model is significantly improved. However, the one-way transformer model cannot find text features effectively, which leads to poor tag extraction, and its accuracy fluctuates greatly with epoch. Figures 10 and 12 show the convergence of each model on the THUCNews data set and Ente-pku data set, respectively. It can be seen from the loss value curve that ATEMDP converges faster than other models to acquire features and can achieve better convergence effect in fewer iterations. The loss value of other models still fluctuates greatly after ATEMDP converges, and they are prone to overfitting. We also give the accuracy curve of the training set to show that this fast convergence is due to the model. Figures 13 and 14 are the accuracy curves of the THUCNews and Ente-pku training sets. It can be seen from Figs. 13 and 14 that the accuracy of the training set has been fluctuating in a reasonable range, so this fast convergence does not prematurely fall into the local optimum.

The confusion matrix can visually represent the extraction effect of different models on each category. Figure 15 shows the statistical results of the confusion matrix of the Ente-pku test set. As can be seen from Fig. 15, there are obvious differences in classification among different models, and the accuracy of the deep learning model is generally higher than that of other machine learning models. The reason is that the deep learning model can better obtain features to adapt to complex enterprise texts. However, due to the sparse features, the deep learning model still performs poorly in some categories. The ATEMDP model with multi-features not only has a good classification effect on categories 2 and 3 with obvious features, but also has a great improvement in categories 1 and 4 with sparse features and difficult delineation. Compared with other models, ATEMDP has a good division effect on data sets with strong features, but the structure of ATEMDP is more complicated, which will sacrifice more computer performance in the same case.

Conclusion and future work

We have proposed an enterprise adaptive tag extraction method based on multi-feature dynamic portrait (ATEMDP) to solve the tag updating problem of dynamic portraits. ATEMDP first uses K-means to measure the similarity between enterprise texts and transforms the similar enterprise text clustering problem into tagged feature clusters to obtain a point cluster structure containing tag preference topics. In addition, ATEMDP uses the subject texts as a sample set into the Bert model to discover the potential features of the enterprise texts in the multi-feature selection. Finally, ATEMDP uses multi-feature extraction model based on neural network to fully mine enterprise text information to achieve dynamic tag extraction. ATEMDP makes full use of the feature information of text and can make adaptive adjustment according to the data type. When the interests of an enterprise change, ATEMDP don’t need to update any models or structures, and immediately divide the preferences, which effectively solves the problem of the difficulty in updating enterprise portrait tags and the low accuracy of extraction.

For future research work, our research will focus on three aspects: (1) we would like to embed tag data of enterprises to improve the accuracy of the model, such as enterprises contextual information; (2) we would like to explore replacing K-means method with other clustering methods for better model performance; (3) we will continue to improve ATEMDP and use the idea of knowledge distillation to output simple text in advance, so as to avoid redundant calculation of samples.

Declarations

Conflict of interest

Not applicable.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel RDQN: ensemble of deep neural network with reinforcement learning in classification based on rough set theory for digital transactional fraud detection

Nächster Artikel Picture fuzzy Additive Ratio Assessment Method (ARAS) and VIseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR) method for multi-attribute decision problem and their application

https://github.com/Dingxingshuo/Dynamic-portrait-extraction.

Qin Z, Wang Y, Cheng H, Zhou Y, Sheng Z, Leung VC (2016) Demographic information prediction: a portrait of smartphone application users. IEEE Trans Emerg Topics Comput 6(3):432–444. https://doi.org/10.1109/TETC.2016.2570603CrossRef

Sun J, Li H, Adeli H (2013) Concept drift-oriented adaptive and dynamic support vector machine ensemble with time window in corporate financial risk prediction. IEEE Trans Syst, Man, Cybern: Syst 43(4):801–813. https://doi.org/10.1109/TSMCA.2012.2224338CrossRef

Wu D, Zhang G, Lu J (2014) A fuzzy preference tree-based recommender system for personalized business-to-business e-services. IEEE Trans Fuzzy Syst 23(1):29–43. https://doi.org/10.1109/TFUZZ.2014.2315655CrossRef

Brickey J, Walczak S, Burgess T (2011) Comparing semi-automated clustering methods for persona development. IEEE Trans Softw Eng 38(3):537–546. https://doi.org/10.1109/TSE.2011.60CrossRef

Guangshang G (2019) Review of research on user portrait construction methods. Data Anal Knowl Discov 3(03):25–35. https://doi.org/10.11925/infotech.2096-3467.2018.0784CrossRef

Maria G, Akrivi K, Costas V, George L, Constantin H (2007) Creating an ontology for the user profile: method and applications. In: In Proceedings AI* AI Workshop RCIS. Citeseer

Deng S, Cai Q, Zhang Z, Wu X (2021) User behavior analysis based on stacked autoencoder and clustering in complex power grid environment. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3076607CrossRef

Wang G, Zhang X, Tang S, Zheng H, Zhao BY (2016) Unsupervised clickstream clustering for user behavior analysis. In: In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 225–236. https://doi.org/10.1145/2858036.2858107

Wang G, Zhang X, Tang S, Wilson C, Zheng H, Zhao BY (2017) Clickstream user behavior models. ACM Trans Web (TWEB) 11(4):1–37. https://doi.org/10.1145/3068332CrossRef

10.

Chen X, Pang J, Xue R (2014) Constructing and comparing user mobility profiles. ACM Trans Web (TWEB) 8(4):1–25. https://doi.org/10.1145/2637483CrossRef

11.

Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, p 785–794. https://doi.org/10.1145/2939672.2939785

12.

Carmagnola F, Cena F, Gena C (2007) User modeling in the social web. In: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, p 745–752. https://doi.org/10.1007/978-3-540-74829-8_91. Springer

13.

Wu L, Ge Y, Liu Q, Chen E, Long B, Huang Z (2016) Modeling users preferences and social links in social networking services: a joint-evolving perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, 30

14.

Carotenuto V, De Maio A (2021) A clustering approach for jamming environment classification. IEEE Trans Aerosp Electron Syst 57(3):1903–1918. https://doi.org/10.1109/TAES.2021.3050655CrossRef

15.

Mandapati S, Kadry S, Kumar RL, Sutham K, Thinnukool O (2022) Deep learning model construction for a semi-supervised classification with feature learning. Complex Intell Syst. https://doi.org/10.1007/s40747-022-00641-9CrossRef

16.

Jiang H, Hu Z, Zhao X, Yang L, Yang Z (2018) Exploring the users’ preference pattern of application services between different mobile phone brands. IEEE Trans Comput Soc Syst 5(4):1163–1173. https://doi.org/10.1109/TCSS.2018.2874466CrossRef

17.

Chun R, Davies G (2006) The influence of corporate character on customers and employees: exploring similarities and differences. J Acad Mark Sci 34(2):138–146. https://doi.org/10.1177/0092070305284975CrossRef

18.

Mat’ová H, Dzian M, Triznová M, Paluš H, Parobek J (2015) Corporate image profile. Procedia Econ Financ 34:225–230. https://doi.org/10.1016/S2212-5671(15)01623-8CrossRef

19.

Zhang Y, Peng L, Liu J, Hong C (2019) An empirical study of mobile social media burnout user profiles in the new media environment based on the SSO theory from the perspective of causality. J Inf 38(10):1092–1101. https://doi.org/10.3772/j.issn.1000-0135.2019.10.009CrossRef

20.

Li X, Xiong H, Du J, Jing Z (2021) Research on book recommendation based on user portraits in smart libraries. Inf Sci 39(07):15–22. https://doi.org/10.13833/j.issn.1007-7634.2021.07.003CrossRef

21.

Calegari S, Pasi G (2013) Personal ontologies: Generation of user profiles based on the yago ontology. Inf Process Manag 49(3):640–658. https://doi.org/10.1016/j.ipm.2012.07.010CrossRef

22.

Tang J, Yao L, Zhang D, Zhang J (2010) A combination approach to web user profiling. ACM Trans Knowl Discov Data (TKDD) 5(1):1–44. https://doi.org/10.1145/1870096.1870098CrossRef

23.

Zhu H, Chen E, Xiong H, Yu K, Cao H, Tian J (2014) Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST) 5(4):1–27. https://doi.org/10.1145/2532515CrossRef

24.

Leung KW, Lee DL (2009) Deriving concept-based user profiles from search engine logs. IEEE Trans Knowl Data Eng 22(7):969–982. https://doi.org/10.1109/TKDE.2009.144CrossRef

25.

Jung SG, An J, Kwak H, Ahmad M, Nielsen L, Jansen BJ (2017) Persona generation from aggregated social media data. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, p 1748–1755. https://doi.org/10.1145/3027063.3053120

26.

Nasraoui O, Soliman M, Saka E, Badia A, Germain R (2007) A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans Knowl Data Eng 20(2):202–215. https://doi.org/10.1109/TKDE.2007.190667CrossRef

27.

Iglesias JA, Angelov P, Ledezma A, Sanchis A (2011) Creating evolving user behavior profiles automatically. IEEE Trans Knowl Data Eng 24(5):854–867. https://doi.org/10.1109/TKDE.2011.17CrossRef

28.

Han H, Zhu X, Li Y (2020) Generalizing long short-term memory network for deep learning from generic data. ACM Trans Knowl Discov Data 14(2):1–28. https://doi.org/10.1145/3366022CrossRef

29.

Wang H, Zhai C, Liang F, Dong A, Chang Y (2014) User modeling in search logs via a nonparametric bayesian approach. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, p 203–212. https://doi.org/10.1145/2556195.2556262

30.

Li X, Zhang K, Zhu Q, Wang Y, Ma J (2021) Hybrid feature fusion learning towards Chinese chemical literature word segmentation. IEEE Access 9:7233–7242. https://doi.org/10.1109/ACCESS.2020.3049136CrossRef

31.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, p 3111–3119

32.

Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p 1532–1543. https://doi.org/10.3115/v1/D14-1162

33.

Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. https://doi.org/10.18653/v1/N18-1202

34.

Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805

35.

Kim Y (2014) Convolutional neural networks for sentence classification. In: the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p 1746–1751

36.

Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570. https://doi.org/10.18653/v1/P17-1052

37.

Wang R, Li Z, Cao J, Chen T, Wang L (2019) Convolutional recurrent neural networks for text classification. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, p 1–6. https://doi.org/10.1109/IJCNN.2019.8852406

38.

Liao C, Chen C, Xiang C, Huang H, Xie H, Guo S (2021) Taxi-passenger’s destination prediction via GPS embedding and attention-based BILSTM model. IEEE Trans Intell Transp Syst 99:1–14. https://doi.org/10.1109/TITS.2020.3044943CrossRef

39.

Li X, Wang Z, Gao S, Hu R, Zhu Q, Wang L (2019) An intelligent context-aware management framework for cold chain logistics distribution. IEEE Trans Intell Transp Syst 20(12):4553–4566. https://doi.org/10.1109/TITS.2018.2889069CrossRef

40.

Zhang K, Li X, Yan Y, Zhu Q, Ma J (2021) Domain expert entity extraction method based on multi-feature bidirectional gated neural network. Journal of Nanjing Normal University (Natural Science Edition) 44(01):128–135 (https://doi.org/10.3969/j.issn.1001-4616.2021.01.018)

41.

Li Y, Wang S, Wei Y, Zhu Q (2021) A new hybrid VMD-ICSS-BIGRU approach for gold futures price forecasting and algorithmic trading. IEEE Trans Comput Soc Syst 8(6):1357–1368. https://doi.org/10.1109/TCSS.2021.3084847CrossRef

42.

Saha T, Upadhyaya A, Saha S, Bhattacharyya P (2022) A multitask multimodal ensemble model for sentiment- and emotion-aided tweet act classification. IEEE Trans Comput Soc Syst 9(2):508–517. https://doi.org/10.1109/TCSS.2021.3088714CrossRef

43.

Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-Bert: Enabling language representation with knowledge graph. AAAI Conf Artif Intell 34:2901–2908

Titel: An enterprise adaptive tag extraction method based on multi-feature dynamic portrait
verfasst von: Xiang Li
Xingshuo Ding
Qian Xie
Shangbing Gao
Quanyin Zhu
Publikationsdatum: 18.03.2023
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 5/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01029-z

Springer Professional

An enterprise adaptive tag extraction method based on multi-feature dynamic portrait

Abstract

Publisher's Note

Introduction

Proposed scheme

Topic clustering and preference division

Feature extraction

Dynamically extract adaptive tags

Experimental analysis and discussion

Data set description

Evaluation metrics

Comparison and result analysis

Conclusion and future work

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Proposed scheme

Topic clustering and preference division

Feature extraction

Dynamically extract adaptive tags

Experimental analysis and discussion

Data set description

Evaluation metrics

Comparison and result analysis

Conclusion and future work

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 5/2023

Efficient S-box construction based on quantum-inspired quantum walks with PSO algorithm and its application to image cryptosystem

A distribution information sharing federated learning approach for medical image data

A novel computerized adaptive testing framework with decoupled learning selector

Transformer tracking with multi-scale dual-attention

BRCE: bi-roles co-evolution for energy-efficient distributed heterogeneous permutation flow shop scheduling with flexible machine speed

Personalized Bayesian optimization for noisy problems

Premium Partner